nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
229 stars 55 forks source link

log output and other #4

Closed DelphIONe closed 6 years ago

DelphIONe commented 6 years ago

Hello,

I used Nanoraw before and with Nanoraw the table Events must be present in Analyses/Basecall_1D_000/BaseCalled_template/, right ? I found that logical. With tombo, it's no necessary. It seems that tombo works with only the Signal table present in Raw/Reads/Read_x/. Can you explain the link between Signal and Events tables please ? I don't understand how tombo can to do a "resquiggle" without the Events table.

I have tried tombo on lambda phage. I obtain this result : tombo resquiggle --graphmap-executable ~/softs/graphmap/bin/Linux-x64/graphmap --processes 8 --align-processes 4 --overwrite /data1/171130_mI_run46_1d/workspace/pass/ ../lambda/lambda.fasta Getting file list. Correcting 136999 files with ... Failed reads summary (14174 total failed): Alignment not produced (if all reads failed check for index files) : 554 No valid path found through raw signal of long read : 2693 No valid path found through raw signal of short read : 1879 No valid path found through start of raw signal : 3552 Read event to sequence alignment extends beyond --bandwidth : 1904 Too little signal around event-aligned genomic deletion : 3592

Can you explain what means "No valid path found through raw signal of long read" ? What is a long read ? a short read ? With graphmap in standalone, from fastq, I obtain a percentage mapping of 99.6 ! Why with tombo (but mapper choosen is graphmap) this percentage is very poorer ?

A last question : tombo will replace nanoraw or there are some specificities not supported by nanoraw or inversely ?

Thanks for your reply, Delphine

marcus1487 commented 6 years ago

For your first question, the signal table holds just the raw signal (current measurements taken at regular intervals). The events table annotates the locations within this raw signal assigned to base calls (along with some additional summary information for each event). The new tombo resquiggle algorithm does not use the events table and instead calls new events from the raw signal (using a slightly different approach than albacore) and then uses a sequence based expected signal level model in order to assign the basecalls to the new events. This does mean that tombo only works on R9.4/5 flowcells now as this is the only model, currently provided. I am working on updated documentation in order to explain some of these details, so that should help with interpretation.

Thus the "no valid path" failed reads are reads where the score of that model-based events to sequence matching were not high enough to indicate a valid matching.

The long and short reads are triggered by the details of the new re-squiggle algorithm. A short read right now, should be anything below 600 bps. Again the updated documentation should help here as well.

For the mapping details, these reads are failing at the signal alignment stage and not the sequence alignment stage. You will note that only 554 of your reads did not produce an alignment. The rest failed at some downstream signal based portion of the algorithm. I will add details for these failed read reports to the documentation as well.

For the future of nanoraw, yes tombo in intended to replace nanoraw. A deprecation notice will be added to nanoraw soon. The old re-squiggle algorithm (model un-aware) with some updates to handle new event formats is available in tombo via the event_resquiggle command. The major issue with this algorithm now is that albacore has switched to raw basecalling and so the event boundaries no longer correspond to the exact start of a base. This causes major issues with signal assignment effecting many downstream processing commands. This was one of the major motivators for the new re-squiggle algorithm (without the events table and using a model).

I hope this answered all of your questions and thank you for your interest in this software!

DelphIONe commented 6 years ago

Thanks for your quickly reply! I have observed reads with status "success" but in alignment, number match, deletion, insertion, mismatch attributs are equal to zero. For example : mapped_start = 3 mapped_strand = - num_deletions = 0 num_insertions = 0 num_matches = 0 num_mismatches = 0

Is it a little bug ?

marcus1487 commented 6 years ago

Yes, I have stopped recording these statistics in the FAST5 (and just set to 0) due to some re-factoring of the code for the new re-squiggle algorithm. I will try to add that functionality back shortly.

rasto2211 commented 6 years ago

+1 for adding number of matches, deletions, insertions and mismatches back to fast5.

rasto2211 commented 6 years ago

Maybe tombo resquiggle could have a flag like --keep-alignment-to-genome which would tell tombo to store the alignment of the read to genome in the fast5 as nanoraw used to do.

marcus1487 commented 6 years ago

The real issue is that when I switched to the new event-less re-squiggle method, I no longer needed the base called sequence after the mapping is completed and so I stripped out the query sequence parsing from the sam record parser. I could very quickly add the indels back, but need to add the query sequence parser in order to correctly annotate with matches and mismatches (and I figured it would be very confusing to just leave mismatches as 0 or the like). This isn't too much work; I just haven't found the time yet. Given the interest I will try to squeeze it into the next release.

marcus1487 commented 6 years ago

I wanted to confirm here that mapping stats are included into tombo with version 1.1.

yuxinPenny commented 1 year ago

Hi, I want to ask what is your tombo version? I want to use Graphmap as aligner, but seems Tombo 1.5.1 does not support it.