nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
117 stars 6 forks source link

Parallelization #150

Open martabaragli opened 3 months ago

martabaragli commented 3 months ago

Hi!

I have been using modkit for a while and I noticed that parallelizing modkit extract doesn't really speed up the job. Specifically, I have been using it both on single bam files output by guppy (one bam for each fast5) and on merged bam files output by dorado (for the entire sequencing run): when running modkit extract on single bams 100 at a time the full run takes about 3 hours, while running modkit extract on the merged bam (same cumulative size) with 100 cores takes up to two days. Do you think maybe the parallelization strategy implemented in modkit extract could be improved? Thank you! Marta

ArtRand commented 3 months ago

Hello @martabaragli,

Are the input mod-BAM files aligned and sorted (with an appropriate index available)? The parallelism in modkit extract is over genomic intervals, so if you have an unaligned mod-BAM the best it can do is read from the file with multiple threads. When you set the --threads flag, do you observe the program actually using that many resources?

For your use case, if you have numerous unaligned mod-BAMs I would try and run modkit extract with few threads in parallel over all the files. I'll consider adding support for running over multiple files.