Question about eventalign parallelization at file level

jts / nanopolish

Signal-level algorithms for MinION data

MIT License

568 stars 159 forks source link

Question about eventalign parallelization at file level #770

Open mmiladi opened 4 years ago

mmiladi commented 4 years ago

Hi,

Is it possible to speedup eventalign computations by splitting the files and/or region windowing?

For example to speedup nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa > all.tsv, split the fastq file and then run:

nanopolish eventalign --reads half1.fastq --bam all.bam --genome genome.fa > half1.tsv
nanopolish eventalign --reads half2.fastq --bam all.bam --genome genome.fa > half2.tsv
cat half1.tsv half2.tsv > all.tsv

Best,

jts commented 4 years ago

Yes, that is the recommended way to speed it up.

Jared

On Apr 25, 2020, at 4:51 AM, Milad Miladi notifications@github.com wrote:

Hi,

Is it possible to speedup eventalign computations by splitting the files and/or region windowing?

For example to speedup nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa > all.tsv, split the fastq file and the run:

nanopolish eventalign --reads half1.fastq --bam all.bam --genome genome.fa > half1.tsv nanopolish eventalign --reads half2.fastq --bam all.bam --genome genome.fa > half2.tsv cat half1.tsv half2.tsv > all.tsv Best,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

mmiladi commented 4 years ago

Great, Thanks. Would this also work with the window option '-w'? For the data I am using, the -w seems to be ineffective as I can see positions outside the requested range withing the .tsv table.

jts commented 4 years ago

Sorry, I misread your issue initially (I shouldn't try to answer emails first thing in the morning...).

Splitting the fastq would work, but isn't the recommended way since it will still iterate over every read in the bam, but ignore them because it won't find the signal data. You should provide a coordinate range as the last argument (without -w though):

nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa chrA:0-1,000,000
nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa chrA:1,000,000-2,000,000
[...]

mmiladi commented 4 years ago

Thanks a lot for your prompt supports. The coordinate option hint would be very life (time) saving :-)

mmiladi commented 4 years ago

Hi @jts ,

I have got stumbled on the expected input of the eventalign range option. There are cases where the output tsv is empty with no errors:

nanopolish eventalign --reads seq.fastq.gz --bam align.bam --genome ref.fa --samples --print-read-names --scale-events chr:21000-22000

[bam process] iterating over region:chr:21000-22000                                                                                                                

[post-run summary] total reads: 17556, unparseable: 0, qc fail: 2, could not calibrate: 0, no alignment: 1, bad fast5: 0

Here, I have spliced reads with 5'end at the upstream of position 21000, but all the reads fully cover the range 21000-22000. It seems, though not so sure, I only get the aligned events if I use a start range that covers the 5'end of the read. Is it the expected behavior? Is there a way to parallelize over a region for all the reads that have (partial or complete) bases aligned to the region? Best, -M