nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
225 stars 55 forks source link

Some confusion about Tombo for the identification of non-standard bases both in DNA and RNA direct sequencing data ? #128

Closed wlhCNU closed 5 years ago

wlhCNU commented 5 years ago

Hi marcus: I have a few questions to ask you for advice about identifying of modified bases from direct RNA nanopore data about the model species arabidopsis thaliana and would appreciate your help.

  1. The log information about my resquiggle analysis with the default parameters was as follows. Almost two-thirds of the raw nanopore sequences were failed to pass resquiggle. What could be the cause of this result and how can I clean my raw nanopore data to solve this kind of problem. In the log information about Detect_modifications analysis(shown as follows), I have some confusion about the precise meaning(100%| 26716/26716 [55:53<00:00, 7.97it/s), the number 26716 represent the successful detect_modification reads from resquiggle analysis or represent number of iterations ?

###################resquiggle log############################################# 17:17:41] Loading minimap2 reference. [17:17:46] Getting file list. [17:26:35] Re-squiggling reads (raw signal to genomic sequence alignment). 100%|##########| 1369723/1369723 [38:57:51<00:00, 9.76it/s]
[08:24:26] Final unsuccessful reads summary (65.2% reads unsuccessfully processed; 893412 total reads): 31.2% ( 426794 reads) : Base calls not found in FAST5 (see tombo preprocess)
28.5% ( 389754 reads) : Alignment not produced
5.4% ( 73451 reads) : Poor raw to expected signal matching (revert with tombo filter clear_filters) 0.2% ( 3406 reads) : Read event to sequence alignment extends beyond bandwidth
0.0% ( 4 reads) : Reference mapping contains non-canonical bases (transcriptome reference cannot contain U bases) 0.0% ( 2 reads) : Too much raw signal for mapped sequence
0.0% ( 1 reads) : Read failed sequence-based signal re-scaling parameter estimation.
[08:24:27] Saving Tombo reads index to file.

#################tomboDetect_modifications#################################### [09:32:28] Parsing Tombo index file(s). [09:32:48] Performing alternative model testing. [09:32:48] Performing specific alternate base(s) testing. [09:32:48] Calculating read coverage regions. [09:32:48] Calculating read coverage. [09:32:58] Performing modified base detection across genomic regions. 100%|##########| 26716/26716 [55:53<00:00, 7.97it/s] ###########################################################################

  1. I see in the tombo github that the depth of coverage could effect the result of modified bases detection to some extent, this effect should be minimal above a certainly level of coverage (probably >10-15X, but this has not been verified). Is this condition generally applies to both DNA and RNA nanopore data. Because the copy numbers of different transcripts were variational in vivo, how to determine the coverage is suited for RNA modified bases analysis ?

    1. What reasonable cutoff about alternate fraction and depth of coverage can use to affirm the identified modified position is credible? For example,one RNA identified 5mC position(use the commond of 'tombo plot most_significant') have the statistical result: 0.33 for fraction and 1 for coverage. How can I judge this one is 5mC modified position or not ?
  2. Could you tell the modified base identify probable accuracy about Tombo different methods for the identification of non-standard bases both in DNA and RNA direct sequencing data, especially for Specific Alternate Base 5mC Detection about RNA direct sequencing data ?

Thanks lihui

marcus1487 commented 5 years ago
  1. As noted in the copied re-squiggled output, the vast majority of reads (~60%) fail due to either failing to base call or failing to align to the reference genome/transcriptome provided.

The progress bar for the detect_modifications output is genomic regions. Have a look at the multiprocessing help output from tombo detect_modifications alternative_model -h for more information.

  1. This recommendation was made primarily from DNA data using the alternative_model testing method. Modified RNA data is a bit trickier to validate as ground truths are much harder to find. In general more coverage is always better, but finding an acceptable level is generally application specific, so we don't have a global recommendation for all applications and/or tombo detection methods at this time.

  2. The 0.33 fraction with a coverage of 1 is due to the --coverage-dampen-counts default. This adds 2 pseudo-reads which are unmodified to each reference position. Thus keeping lower coverage regions from dominating the most significant sites from a run (as the fraction is an unstable measure for low coverage).

As with the previous question, this cutoff is application specific.

  1. As noted above, ground truth data sets for RNA are not as easy to create, so we don't have accuracy metrics for this method at this time.