nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
232 stars 54 forks source link

Consistent Failed Reads Summary #98

Closed gkr2381 closed 6 years ago

gkr2381 commented 6 years ago

I'm trying to understand what I can do about these errors:

$ tombo resquiggle /Volumes/Macintosh\ HD/Library/MinKNOW/data/reads/Reg_RNA_copy/fast5/pass/0/ /Volumes/Macintosh\ HD/Library/MinKNOW/mm10RNA_reference_genomes/mrna.fa --processes 4 --overwrite

[16:23:29] Loading minimap2 reference.

[16:23:41] Getting file list.

[16:23:41] Using default canonical ***** RNA ***** model.

[16:23:42] Re-squiggling reads (raw signal to genomic sequence alignment).

100%|█████████████████████████████████████████████████████████████████| 1772/1772 [00:38<00:00, 46.07it/s]

[16:24:20] Failed reads summary (1771 total failed):

    Alignment not produced :    1163

    Not enough raw signal around potential genomic deletion(s) :    2

    Poor raw to expected signal matching (revert with `tombo clear_filters`) :  584

    Read event to sequence alignment extends beyond --bandwidth :   20

    Read failed sequence-based signal re-scaling parameter estimation. :    2

I understand that the 'Failed reads summary' means that minimap2 wasn't able to map my reads to the reference genome. I've been struggling with getting any libraries to align with any reference genomes. Initially, my thought was that something had gone awry in my RNA transcribing process and/or library run, but given the amount of libraries I’ve tried to use (made by me and other people) and the number of different reference genomes I’ve tried, I’m wondering if there’s another issue tucked away somewhere I haven’t thought of. Tombo was updated before running these commands, so I don't believe the issue is related to version differences.

Does this have to do with the difference in pore models for RNA runs (-180 vs -200) as discussed in another github issue post, perhaps? That issue thread seems to be the only place where I'm finding identical Failed reads summary errors.

Any advice would be appreciated.

Nastiya commented 6 years ago

Hi,

I'm having the same issue as well, but it seems that part of reads filed. Ether way, I didn't got any output file. I'll also appreciate some input regarding this issue.

Best, Anastasia

My output :

(ana) ana:/$ tombo preprocess annotate_raw_with_fastqs --fast5-basedir /home/ana/Nanopore/fast5/0/ --fastq-filenames /home/ana/Nanopore/pass.fastq --processes 4 [12:31:03] Getting read filenames. [12:31:03] Preparing reads and extracting read identifiers. 100%|███████████████████████████████████████████████████████████████| 883/883 [00:05<00:00, 176.46it/s] [12:31:08] Annotating FAST5s with sequence from FASTQs. 90%|████████████████████████████████████████████████████████▊ | 797/883 [00:01<00:00, 701.24it/s] [12:31:09] Added sequences to a total of 797 reads. (ana) ana:/$ tombo resquiggle /home/ana/Nanopore/fast5/0/ /home/ana/Nanopore/ref.fasta --processes 4 [12:34:24] Loading minimap2 reference. [12:34:24] Getting file list. [12:34:24] Using default canonical DNA model. [12:34:24] Re-squiggling reads (raw signal to genomic sequence alignment). 100%|████████████████████████████████████████████████████████████████| 883/883 [01:56<00:00, 7.58it/s] [12:36:21] Failed reads summary (94 total failed): Alignment not produced : 6 Fastq slot not present in --basecall-group : 86 Not enough raw signal around potential genomic deletion(s) : 1 Poor raw to expected signal matching (revert with tombo clear_filters) : 1

marcus1487 commented 6 years ago

@gkr2381 , for the Alignment not produced errors, these are purely produced by minimap2 (via the mappy python API). These errors result from the basecalls found in the Fastq slot of the fast5 using mappy to align these to the reference sequence and failing to produce a mapping result. This means that the Tombo models (or any Tombo parameters) would not effect these reads.

The one important bit to point out is these mappings do not allow for spliced mapping. Please see the discussion for these reasons for this here in the Tombo documentation. If you are mapping potentially spliced reads to a genomic reference (which does not appear to be the case from your command including mrna.fa; but I thought it worth mentioning) this is almost certainly your problem.

A related potential issue here is that you have reads mapping to unannotated transcripts. You might try to map reads to the genome using a spliced mapper and see if a large fraction of reads map to regions without transcript annotations. This may include bits such as rRNA or other non-coding transcripts which are often not included in mrna.fa files.

@Nastiya , the vast majority of your reads are failing due to the Fastq slot not present in --basecall-group error which indicates that these reads failed basecalling. Tombo cannot start processing reads without valid mapping reads. But in your case you have close to 90% of your reads successfully processed which should be more than sufficient in most cases. It is expected that some reads will fail in most Tombo runs. This summary is simply intended to provide details as to how those reads failed and potentially identify issues for initial Tombo processing.

As for no output file being produced, this is the expected behavior for the resquiggle command. This command is intended as a first processing step before the number of downstream tombo commands for modified base detection and raw signal visualization. See the quick start and full documentation for several examples of downstream Tombo command pipelines.

Best of luck!

gkr2381 commented 6 years ago

Hi Marcus,

Is there a spliced mapper you can recommend that is equipped to work with MinION output?

marcus1487 commented 6 years ago

minimap2 is a very good spliced read mapper for nanopore data. This minimap2 mode is simply not supported in Tombo. As noted in the RNA section of the Tombo documentation, this is a conscious decision based on the goals of and practical considerations around raw signal mapping and subsequent detection of modified bases.

So for nanopore direct RNA data, if you is interested in base level analysis I would recommend the spliced mode for minimap2. If you are interested in signal level analyses, I would recommend generating a transcriptome reference first, and then applying Tombo from that starting point.

gkr2381 commented 6 years ago

That makes sense, there was some confusion surrounding minimap2 mode support. I'm interested in signal level analyses, so I'll look into generating a transcriptome reference.

Thank you!