smithlabcode / methpipe

A pipeline for analyzing DNA methylation data from bisulfite sequencing.
http://smithlabresearch.org/methpipe
67 stars 27 forks source link

optimizing pipeline (especially speed) #201

Closed moqri closed 1 year ago

moqri commented 2 years ago

Thanks for creating this great tool and keep maintaining it!

Below is my pipeline get from fastq to methcount. Any suggestion on ways to improve this (especially on optimizing speed)? I have a 48-core with 120 GB memory.

trim_galore --paired -q 0 --length 0 "$f"_1.fastq.gz  "$f"_2.fastq.gz -j 48
abismal -i $ind "$f"_1_val_1.fq.gz "$f"_2_val_2.fq.gz -t 48 -v | samtools view -b > "$f".bam 
format_reads -f abismal "$f".bam -o "$f"_f.sam 
samtools sort -O bam -o "$f"_fs.bam "$f"_f.sam -@ 16 -T tmp # -m 8G ?
duplicate-remover -S "$f"_stat.txt "$f"_fs.bam "$f"_fsd.sam
methcounts -c $genome -o "$f".meth "$f"_fsd.sam -v -n
guilhermesena1 commented 2 years ago

I think this is currently as fast as it gets speed-wise. After abismal every tool is IO-bound, so most of the time is spent reading and writing intermediate files for processing. I should point out that format_reads and duplicate-remover are able to print results to stdout, so you can continue working with BAM files by piping to samtools view -b like you did in abismal and samtools sort. At first glance it's the most I can see as far as improvements (in storage efficiency rather than speed) can go.

moqri commented 2 years ago

Thanks, Gui, I'll keep working on it and will update here if I found ways to make it faster

andrewdavidsmith commented 1 year ago

@moqri I'm going to close this. It's true that we should think about ways to eliminate intermediate steps, and as @guilhermesena1 mentioned that most steps after mapping are IO-bound. But for the most general use people want access to those intermediate files and trust the external programs to do those jobs. We might bundle format with abismal, and in principle after sorting we could pipe duplicate-remover to methcounts. My current practice to optimize these steps is to work back-and-forth between a pair of locally attached disks on the node I'm using. But I know that's not always available. Even ensuring you are using a local disk can help. Make sure if you are using a distributed filesystem that it's not the bottleneck due to poor configuration.