skovaka / UNCALLED

Raw nanopore signal mapper that enables real-time targeted sequencing
MIT License
520 stars 44 forks source link

Generating sequencing summary from fast5 raw reads #47

Open maximilianmordig opened 2 years ago

maximilianmordig commented 2 years ago

Hi @skovaka Thank you for developing UNCALLED.

I am wondering how to generate the sequence_summary file that is necessary to run the "uncalled sim" command as described in the README: /path/to/control/fast5s --ctl-seqsum /path/to/control/sequencing_summary.txt. These files don't seem to be provided. So I have downloaded some E. coli fast5 raw reads, but they unfortunately don't come with the sequencing_summary.txt. To my understanding, the control fast5 files are only used to have the fast5 raw signal in the simulation, so I am also wondering why it relies on fields such as template_duration which is basecaller specific.

Thank you.

skovaka commented 2 years ago

We mainly use the sequencing summary to infer the timing between reads on each channel. This information is present in the fast5s as well, but parsing through every fast5 file takes much much longer than reading one text file. We also use the template start and duration in order to trim the adapter sequence and any noisy signal from each end of the reads. The ReadUntil API is able to do this in real-time, and the sequencing summary was the best/easiest way I could find to mimic that behavior. So, you are correct that it should be possible to simulate without a sequencing summary, but it would take some effort to work around those issues.

Some example sequencing summaries from human and a mock microbial community are available here: https://labshare.cshl.edu/shares/schatzlab/www-data/UNCALLED/simulator_files/