nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
231 stars 54 forks source link

cDNA sequencing results in modification detection #225

Closed moldovannorbert closed 4 years ago

moldovannorbert commented 4 years ago

I am trying to verify the false positive rate on tombo v1.5. I was running tombo with the following commands For the dRNA datasets detect_modifications alternative_model --alternate-bases 5mC --rna --minimum-test-reads 5. For the cDNA dataset detect_modifications alternative_model --alternate-bases 5mC --dna --minimum-test-reads 5.

In the dRNA Sample the raw fraction of modified Cs was 30% (10% with a 0.75 modified fraction cutoff), while the dampened fraction of modified Cs was 30% (1% with a 0.75 modified fraction cutoff).

I tested two more cases which I assumed would show moderate to no modification:

  1. I ran tombo on the enolase2 control shipped by ONT, which they say to be synthetic RNA (without any modifications). This showed a raw fraction of modified Cs of 98% (11% with a 0.75 modified fraction cutoff), while the dampened fraction had the same percentages.
  2. I ran tombo on an amplified cDNA sample. This showed a raw fraction of modified Cs of 51% (8% with a 0.75 modified fraction cutoff).

The figure below shows the distribution of modified fractions for each above-mentioned samples. Panel A showing all the data while Panel B fractions > 0.1. Compare

  1. Why does tombo detect so many modified bases in supposedly unmodified sequences (cDNA and synthetic RNA)?
  2. Can you give guidance on how can one decide if a modification detected by the software is real or not using only the alternative_model method?
marcus1487 commented 4 years ago

I'd first like to clarify the metrics presented here. When you indicate the "the raw fraction of modified Cs was 30%" is this a count of genomic locations where the fraction of modified reads is greater than 0%? This seems to be the case as most all of the samples appear to show very low fractions of modified bases from the plots presented.

I would suggest instead computing the fraction of modified bases on a per-read level. The total number of canonical and modified bases can be computed from the raw fraction and coverage values at each genomic position. Then a global fraction of modified bases could be computed and should be much lower, and hopefully more within expectations.

That being said, the all-context models in tombo are somewhat problematic (especially in contexts with multiple copies of the same base) due to the model estimation procedure. This is one of the main reasons that modified base detection development has moved from tombo to megalodon. But megalodon does not perform all of the tasks that tombo does and thus tombo has not yet been completely deprecated.

The lowest false positive rates are expected for the context-specific models: dam, dcm and cpg. The all context models show some false positives. Then the de novo model generally shows the highest level of false positives. But each of these models has a trade off in their capabilities; mostly the contexts in which they can identify modified bases.

As an example of this, while the de novo method has a high false positive rate, it can be effectively used to identify bacterial modification motifs (see docs) with no other knowledge aside from the reference sequence.

Additionally, RNA signal is problematic for k-mer level based analyses and thus resulting in your observation of higher false positive modified base detection rates. Again this is part of the reason for shift these capabilities from tombo into megalodon.

I hope this helps clarify the metrics you are seeing here.