nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
230 stars 54 forks source link

tombo build_model estimate_alt_reference #309

Open esand93 opened 3 years ago

esand93 commented 3 years ago

Hello, we are trying to develop an alternative model for modified base Recognition with the tombo build_model estimate_alt_reference command, but we keep getting the error message

**** ERROR **** Too few minimal k-mer observations to continue to alternative estimation. Minimal k-mer has 0 total observations and 50 observations per k-mer are required.

Our Samples (Manufactured by PCR) Barcode04: plasmid canonical (4.8 kB) Barcode05: plasmid with ~25% modified base content (4.8kB)

How we processed our sample:

Step 1: Resquiggle normally: tombo resquiggle barcode04 ref_100.fasta --processes 4 --num-most-common-errors 5

Step 2: Build canonical model: tombo build_model estimate_reference --fast5-basedirs barcode04 --tombo-model-filename canonical-model

Step 3: Resquiggle with canonical model: tombo resquiggle barcode04 ref_100.fasta --processes 4 --num-most-common-errors 5 --overwrite --tombo-model-filename canonical-model tombo resquiggle barcode05 ref_100.fasta --processes 4 --num-most-common-errors 5 --overwrite --tombo-model-filename canonical-model

Step 4: Then we tried to build the model: tombo build_model estimate_alt_reference --alternate-model-filename modification_detection --alternate-model-name mod_base --alternate-model-base G --fast5-basedirs barcode05 --control-fast5-basedirs barcode04 --minimum-kmer-observations 50

Where we got the following error message: **** ERROR **** Too few minimal k-mer observations to continue to alternative estimation. Minimal k-mer has 0 total observations and 50 observations per k-mer are required.

We checked a variety of things to find the error:

  1. Tried building the model directly after Step 1 ->Same error Message

  2. Checked Coverage: Looks okay, (between 5000-9000) but is low at the terminal regions.

  3. To check if it’s the low terminal regions were causing this error message, we tested the directed modeling with the tombo build_model estimate_motif_alt_reference command. The bed file was directed to a modified base in the genome where we knew that the coverage was between 4000-6000. However, we got again an error message. tombo build_model estimate_motif_alt_reference \ --fast5-basedirs barcode05 \ --alternate-model-filename modification_detection \ --alternate-model-name mod --motif-description CGA:2 \ --valid-locations-filename modified_locations.bed \

Error:

At least one modified k-mer is not covered at any poitions by --minimum-test-reads.

Consider fitting to a smaller k-mer via the --upstream-bases and --downstream-bases, or lowering --minimum-test-reads.

  1. Tried resquiggling with a different kmer size (Adjusted kmer)
    1. Resquiggle normally. tombo resquiggle barcode04 ref.fasta --processes 4 --num-most-common- errors 5
    2. Build model with smaller kmer tombo build_model estimate_reference --fast5-basedirs barcode04 -- downstream-bases 1 --tombo-model-filename canonical-model2
    3. Resquiggle with this model tombo resquiggle barcode04 ref.fasta --processes 4 --num-most-common- errors 5 --overwrite --downstream-bases 1 --tombo-model-filename canonical-model2

-> Did not work, almost all reads (~90%) were discarded because of bad alignment.

Questions: -> as we understood, the default kmer for resquiggle is a 4mer. When we want to train a model with the "tombo build_model estimate_alt_reference" command, do all combinations of the 4mer have to be present in the sample? And is this why we get the error message? If so, is there any setting that would allow us to only train a model for the 4mers that are present in our sample? -> Is it likely that we do not have enough data (Coverage is too low) to build the alternative model or are we doing something else wrong?

->Is there a reasonable way to trim the terminal regions of the raw data files where the coverage is almost 0?

Thank you, and looking forward for your reply!

SycamoreLeaf commented 3 years ago

Hi @esand93. Welcome to Github!

I'm not an expert at Tombo but I'll try to help with your first question.

The error you referenced came from here in the source code: https://github.com/nanoporetech/tombo/blob/7473afe6e023bf37e7e1f0d7e5f973a1318b888a/tombo/tombo_stats.py#L1869-L1878

It looks from the neighboring code like Tombo wants a certain minimal coverage of each possible kmer of A, C, T, and G. https://github.com/nanoporetech/tombo/blob/7473afe6e023bf37e7e1f0d7e5f973a1318b888a/tombo/tombo_stats.py#L1820-L1822

So anyway, it appears from the code that Tombo requires every kmer to be present (with sufficient coverage) in your sample. It also appears there is no way to override this. (Again, I am not an expert on this.)

marcus1487 commented 3 years ago

I can confirm @Chris-Kimmel 's answer. Tombo models require that all values of the k-mer must be observed in order to produce a model. There are no workarounds for this as such a model would produce confusing results on processed data and a lot of case handling would have to be done to run such a model.

It is a bit odd that the canonical model was able to train. This indicates that the plasmid has enough sequence diversity containing all of the k-mers. Are there gaps in coverage in the modified sample or is this from a different sample?

Additionally, the 4-mer is the default for this command, but released models are trained using a 6-mer. A 3-mer is not likely to produce a useful model. This is likely the reason for the poor scores using this model (i.e. the 3-mer does not estimate the expected signal level well enough).

I hope this helps clear up the requirements for a Tombo model. Please post if there are specific questions about how to proceed with this analysis.

esand93 commented 3 years ago

Thank you for both of your answers!

It is a bit odd that the canonical model was able to train. This indicates that the plasmid has enough sequence diversity containing all of the k-mers. Are there gaps in coverage in the modified sample or is this from a different sample? --> we checked and calculated the 4-mers in the fasta file, and indeed each 4-mer is contained in that sequence (>6x) -->gaps of coverage exist for around the first 5-10 bases and the last 5-10 bases. But I guess that shouldn't be a big problem, if there aren't any unique kmers exactly at that position..?

This is my output when I try to build the model, Maybe it helps for troubleshooting. (We included more data for analysis, so the coverage is now around 30000) tombo build_model estimate_alt_reference --alternate-model-filename modification_detection --alternate-model-name mod_base --alternate-model-base G --fast5-basedirs barcode05 --control-fast5-basedirs barcode04

[11:10:03] Parsing Tombo index file(s). [11:10:03] Parsing Tombo index file(s). [11:10:03] Parsing standard model file. [11:10:03] Loading default canonical DNA model. [11:10:04] Parsing base levels from alternative reads.

Number of total reads used: 0%| | 0/28157 [00:00<?, ?it/s]

K-mer with fewest observations: 0%| | 0/1000 [00:00<?, ?it/s] Number of total reads used: 4%|â–Ž | 1000/28157 [01:05<29:37, 15.28it/s]

K-mer with fewest observations: 0%| | 0/1000 [01:05<?, ?it/s] Number of total reads used: 7%|â–‹ | 2000/28157 [02:06<27:29, 15.86it/s]

K-mer with fewest observations: 0%| | 0/1000 [02:06<?, ?it/s] Number of total reads used: 11%|â–ˆ | 3000/28157 [03:11<26:47, 15.65it/s]

K-mer with fewest observations: 0%| | 0/1000 [03:11<?, ?it/s] Number of total reads used: 14%|█▍ | 4000/28157 [04:14<25:34, 15.74it/s]

K-mer with fewest observations: 0%| | 0/1000 [04:14<?, ?it/s] Number of total reads used: 18%|█▊ | 5000/28157 [05:12<24:07, 16.00it/s]

K-mer with fewest observations: 0%| | 0/1000 [05:12<?, ?it/s] Number of total reads used: 21%|██▏ | 6000/28157 [06:10<22:48, 16.19it/s]

K-mer with fewest observations: 0%| | 0/1000 [06:10<?, ?it/s] Number of total reads used: 25%|██▍ | 7000/28157 [07:40<23:12, 15.20it/s]

K-mer with fewest observations: 0%| | 0/1000 [07:40<?, ?it/s] Number of total reads used: 28%|██▊ | 8000/28157 [08:48<22:11, 15.14it/s]

K-mer with fewest observations: 0%| | 0/1000 [08:48<?, ?it/s] Number of total reads used: 32%|███▏ | 9000/28157 [09:48<20:51, 15.30it/s]

K-mer with fewest observations: 0%| | 0/1000 [09:48<?, ?it/s] Number of total reads used: 36%|███▌ | 10000/28157 [10:46<19:34, 15.46it/s]

K-mer with fewest observations: 0%| | 0/1000 [10:46<?, ?it/s] Number of total reads used: 39%|███▉ | 11000/28157 [11:40<18:12, 15.70it/s]

K-mer with fewest observations: 0%| | 0/1000 [11:40<?, ?it/s] Number of total reads used: 43%|████▎ | 12000/28157 [12:40<17:03, 15.79it/s]

K-mer with fewest observations: 0%| | 0/1000 [12:40<?, ?it/s] Number of total reads used: 46%|████▌ | 13000/28157 [13:32<15:47, 16.00it/s]

K-mer with fewest observations: 0%| | 0/1000 [13:32<?, ?it/s] Number of total reads used: 50%|████▉ | 14000/28157 [14:31<14:41, 16.06it/s]

K-mer with fewest observations: 0%| | 0/1000 [14:31<?, ?it/s] Number of total reads used: 53%|█████▎ | 15000/28157 [15:29<13:35, 16.14it/s]

K-mer with fewest observations: 0%| | 0/1000 [15:29<?, ?it/s] Number of total reads used: 57%|█████▋ | 16000/28157 [16:23<12:27, 16.27it/s]

K-mer with fewest observations: 0%| | 0/1000 [16:23<?, ?it/s] Number of total reads used: 60%|██████ | 17000/28157 [17:17<11:20, 16.39it/s]

K-mer with fewest observations: 0%| | 0/1000 [17:17<?, ?it/s] Number of total reads used: 64%|██████▍ | 18000/28157 [18:14<10:17, 16.44it/s]

K-mer with fewest observations: 0%| | 0/1000 [18:14<?, ?it/s] Number of total reads used: 67%|██████▋ | 19000/28157 [19:09<09:14, 16.52it/s]

K-mer with fewest observations: 0%| | 0/1000 [19:09<?, ?it/s] Number of total reads used: 71%|███████ | 20000/28157 [20:05<08:11, 16.60it/s]

K-mer with fewest observations: 0%| | 0/1000 [20:05<?, ?it/s] Number of total reads used: 75%|███████▍ | 21000/28157 [20:57<07:08, 16.70it/s]

K-mer with fewest observations: 0%| | 0/1000 [20:57<?, ?it/s] Number of total reads used: 78%|███████▊ | 22000/28157 [21:48<06:06, 16.81it/s]

K-mer with fewest observations: 0%| | 0/1000 [21:48<?, ?it/s] Number of total reads used: 82%|████████▏ | 23000/28157 [22:47<05:06, 16.81it/s]

K-mer with fewest observations: 0%| | 0/1000 [22:47<?, ?it/s] Number of total reads used: 85%|████████▌ | 24000/28157 [23:55<04:08, 16.72it/s]

K-mer with fewest observations: 0%| | 0/1000 [23:55<?, ?it/s] Number of total reads used: 89%|████████▉ | 25000/28157 [25:04<03:10, 16.61it/s]

K-mer with fewest observations: 0%| | 0/1000 [25:04<?, ?it/s] Number of total reads used: 92%|█████████▏| 26000/28157 [26:12<02:10, 16.53it/s]

K-mer with fewest observations: 0%| | 0/1000 [26:12<?, ?it/s] Number of total reads used: 96%|█████████▌| 27000/28157 [27:21<01:10, 16.45it/s]

K-mer with fewest observations: 0%| | 0/1000 [27:21<?, ?it/s] Number of total reads used: 99%|█████████▉| 28000/28157 [28:32<00:09, 16.35it/s]

K-mer with fewest observations: 0%| | 0/1000 [28:32<?, ?it/s] Number of total reads used: 100%|██████████| 28157/28157 [28:45<00:00, 16.32it/s]

K-mer with fewest observations: 0%| | 0/1000 [28:45<?, ?it/s]

**** ERROR **** Too few minimal k-mer observations to continue to alternative estimation. Minimal k-mer has 0 total observations and 50 observations per k-mer are required.

Do you have any suggestion why we still get an error message saying we have too few k-mers? Could it also be that there is too little difference between the modified and control sample?

Thanks, and looking forward to your reply!

SycamoreLeaf commented 3 years ago

tl;dr Try specifying --tombo-model-filename canonical-model when you run tombo build_model estimate_alt_reference. Otherwise Tombo will use 6-mers.

In the code that counts k-mers, the value of k is determined by std_ref.kmer_width: https://github.com/nanoporetech/tombo/blob/7473afe6e023bf37e7e1f0d7e5f973a1318b888a/tombo/tombo_stats.py#L1820-L1822

Tracing it two levels up the stack, std_ref is a TomboModel imported from standard_ref_fn: https://github.com/nanoporetech/tombo/blob/7473afe6e023bf37e7e1f0d7e5f973a1318b888a/tombo/tombo_stats.py#L1948-L1959

The program is only using std_ref (1) to specify the value of k (as mentioned) and (2) to specify which nucleotide in the k-mer should be considered the "center" nucleotide.

Foregoing --tombo-model-filename in your shell command results in standard_ref_fn being None. That makes the TomboModel constructor default to Tombo's "production" model, which uses 6-mers. Thus your alternative model uses hexamers as well.

4800 random bases should be sufficient to represent every 4mer, but not to represent every 6-mer, hence your error.

Hope that helps, @esand93 . @marcus1487 will surely know if I'm imagining things.

marcus1487 commented 3 years ago

This is a great summation of the issue by @Chris-Kimmel . Not sure I have much to add. Let me know if you have any further issues with these commands.

esand93 commented 3 years ago

Adding "--tombo-model-filename canonical-model" to the command below solved the issue for building a model! Thank you so much for your help!

tombo build_model estimate_alt_reference --tombo-model-filename canonical-model --alternate-model-filename modification_detection --alternate-model-name mod_base --alternate-model-base G --fast5-basedirs barcode05 --control-fast5-basedirs barcode04

I then tried to continue to build a model for a specific motif but run into the same problem again which couldn't be solved the same way. When I run the following command I got the error message "At least one modified k-mer is not covered at any poitions by --minimum-test-reads."

tombo build_model estimate_motif_alt_reference --fast5-basedirs barcode05 --alternate-model-filename modification_detection --alternate-model-name mod --motif-description CAGT:3 --valid-locations-filename modified_locations.bed

Then I tried to add --tombo-model-filename canonical-model to the command, which resulted in

tombo build_model estimate_motif_alt_reference --tombo-model-filename canonical-model --fast5-basedirs barcode05 --alternate-model-filename modification_detection --alternate-model-name mod --motif-description CAGT:3 --valid-locations-filename modified_locations.bed

"tombo: error: unrecognized arguments: --tombo-model-filename canonical-model"

Is there a way how to solve this issue?

Thank you, and looking forward to your answer!

marcus1487 commented 3 years ago

The tombo build_model estimate_motif_alt_reference command does not take the --tombo-model-filename argument. Instead this command takes the --upstream-bases and --downstream-bases arguments as in the canonical model training command. It is left up to the user to match these values to the canonical model to be used.

I can't be certain why I did not have the canonical model as an argument here, but I'm guessing the canonical model may be used in the standard alternative model estimation step. It does seem like it would be good to have that same setting for this command.

Note that all tombo commands will print all possible arguments given the -h or --help flags. This should help resolve issues such as this in the future.

esand93 commented 3 years ago

When I run the estimate_reference command to train the canonical model, I didn't change the kmer values. As far as I understood, it will then use the default values (upstream-bases 1 and downstream-bases 2). When I run the estimate_motif_alt_reference command, I thought I won't need to specify the upstream and downstream bases, as the default values are the same as in the estimate_reference command. (upstream-bases 1 and downstream-bases 2)

I anyway tried to specify the upstream- and downstream-bases, but it didn't help. tombo build_model estimate_motif_alt_reference --fast5-basedirs barcode05 --alternate-model-filename modification_detection --alternate-model-name mod --motif-description CAGT:3 --valid-locations-filename modified_locations.bed --upstream-bases 1 --downstream-bases 2 --> At least one modified k-mer is not covered at any poitions by --minimum-test-reads

Strangely, it is able to build a model when I change upstream- and downstream-bases to: --upstream-bases 1 --downstream-bases 0 or --upstream-bases 0 --downstream-bases 1

However, then it's not matching the k-mer length of the canonical model, and will lead to errors later in detect_modifications command.

Do you have any idea, what is going wrong?

marcus1487 commented 3 years ago

This error is occurring since one of the k-mers contained within the --motif-description is not covered by the sample provided at the depth defined by --minimum-test-reads. You can try to lower the value of --minimum-test-reads. Tombo does not have a function to help determine at what level this will work, but the coverage output file from this sample may assist with this task.

esand93 commented 3 years ago

I tried to reduce the --minimum-test-reads to 1, but I still get the same error. I also checked the position that I try to model has a coverage of 16'000 (according to the genome_locations plot), so I think it should be more than enough to estimate a model for that position. Could it be that there is maybe something wrong with my bedfile? It looks simply like this: PlasmidName 3976 3977 0 0 -

Thanks for your help!

marcus1487 commented 3 years ago

Ah. The bed file is your issue. What is the use case that you are aiming for? I would certainly not recommend training a model from a single location. The model would be very biased to the local context (outside of the motif of interest) at that particular location.

The other issue is that the motif model does not model just the signal when that base is at the center of the motif, but also when the modified base of interest is in all offsets within the k-mer (restricted to the motif when using estimate_motif_alt_reference). Thus you need to observe the motif within all of the contexts outside of the fixed positions in order to train a model (or else the model would not have estimates for the signal when observing those motifs when the model is applied).

Hopefully this makes sense. Here are some more docs that might help: https://nanoporetech.github.io/tombo/modified_base_detection.html#specific-alternate-base-detection-recommended

esand93 commented 3 years ago

Sorry it might have been confusing, and I hope this is not getting more confusing now when I try to explain. I'm actually working with different set of samples. Set 1 (which I've been describing above): PCR samples with randomly inserted modifications. Set 2a: same plasmid as Set 1, same modification is inserted at position 3977 Set 2b: same plasmid as Set 1, different modification is inserted at position 3977 Set 2c: same plasmid as Set 1, another different modification is inserted at position 3977

I tried to a build model on Set 1 (tombo build_model estimate_alt_reference), and wanted to test it on Set 2a. But it recognized the modification poorly (I assume there is something wrong with my PCR samples). So then I just wanted to try out to build a model on my Set 2a samples around sequence 3977, and then test how well model recognizes the modification in Set 2a and how well it distinguishes from the other two modifications from Set 2b and 2c. So basically I won't really use this model for something useful, I was just curious to see how well it works. That's why I directed the bed file to that position. However, I built a model now with the tombo build_model estimate_alt_reference with my Set 2a samples and it worked quite well with detection of my modification and distinguishing from other modifications from 2b and 2c.

Anyway, I will repeat the PCR and hope I will get more modifications inserted, and then I will try to build a model again over the full plasmid and hopefully it will work better then!

Thanks for your comments and suggestions, it was really helpful!