Open martabaragli opened 3 months ago
Hello @martabaragli,
Are the input mod-BAM files aligned and sorted (with an appropriate index available)? The parallelism in modkit extract
is over genomic intervals, so if you have an unaligned mod-BAM the best it can do is read from the file with multiple threads. When you set the --threads
flag, do you observe the program actually using that many resources?
For your use case, if you have numerous unaligned mod-BAMs I would try and run modkit extract
with few threads in parallel over all the files. I'll consider adding support for running over multiple files.
Hi!
I have been using modkit for a while and I noticed that parallelizing
modkit extract
doesn't really speed up the job. Specifically, I have been using it both on single bam files output by guppy (one bam for each fast5) and on merged bam files output by dorado (for the entire sequencing run): when runningmodkit extract
on single bams 100 at a time the full run takes about 3 hours, while runningmodkit extract
on the merged bam (same cumulative size) with 100 cores takes up to two days. Do you think maybe the parallelization strategy implemented inmodkit extract
could be improved? Thank you! Marta