Closed moqri closed 1 year ago
I think this is currently as fast as it gets speed-wise. After abismal every tool is IO-bound, so most of the time is spent reading and writing intermediate files for processing. I should point out that format_reads
and duplicate-remover
are able to print results to stdout, so you can continue working with BAM files by piping to samtools view -b
like you did in abismal
and samtools sort
. At first glance it's the most I can see as far as improvements (in storage efficiency rather than speed) can go.
Thanks, Gui, I'll keep working on it and will update here if I found ways to make it faster
@moqri I'm going to close this. It's true that we should think about ways to eliminate intermediate steps, and as @guilhermesena1 mentioned that most steps after mapping are IO-bound. But for the most general use people want access to those intermediate files and trust the external programs to do those jobs. We might bundle format
with abismal
, and in principle after sorting we could pipe duplicate-remover
to methcounts
. My current practice to optimize these steps is to work back-and-forth between a pair of locally attached disks on the node I'm using. But I know that's not always available. Even ensuring you are using a local disk can help. Make sure if you are using a distributed filesystem that it's not the bottleneck due to poor configuration.
Thanks for creating this great tool and keep maintaining it!
Below is my pipeline get from fastq to methcount. Any suggestion on ways to improve this (especially on optimizing speed)? I have a 48-core with 120 GB memory.