jpuritz / dDocent

a bash pipeline for RAD sequencing
ddocent.com
MIT License
52 stars 41 forks source link

performance bottleneck #72

Closed pdimens closed 3 years ago

pdimens commented 3 years ago

It just occurred to me that https://github.com/jpuritz/dDocent/blob/3615f8e2a405eb35595b77c60955c0a3dbfe15e1/dDocent#L364 is a performance bottleneck because the mapping process all but halts to sort the single bam file before finishing the iteration. Would you be interested in me modifying this/these loops to democratize it a bit so the sorting exploits the number of threads more? My thinking is that the options are:

As it is now, the mapping part of dDocent maximizes threading per individual rather than across individuals (I'm a fan, as it quickens time-till-first-error), so my guess is that same strategy can be applied to the samtools sort call, or just split sort jobs across threads.

Do you have any thoughts on this?

pdimens commented 3 years ago

While not the end of the world, I tested the second option of moving the samtools sort out of the loop body and got a bit of a speedup working with 10 files: dDocent 29577.63s user 583.83s system 682% cpu 1:13:37.58 total

dDocent_mod
28304.44s user 569.33s system 765% cpu 1:02:52.20 total

# sort
find ./ -name "*.bam" | parallel --jobs $(( $NUMProc / $SAMProc )) samtools sort -@$SAMProc {} -o {} "2>>" {}.log
# rename to -RG.bam. This parallele {.} trick removes the last extension from the input filename 
find . -name "*.bam" | parallel --jobs 1 mv {} {.}-RG.bam 
# index output bam files
find ./ -name "*-RG.bam" | parallel --jobs $NUMProc samtools index {}

If you're cool with this, I'd like to submit a PR for this and #68 , though the latter will need to add bedops as a dep to install_ddocent and the conda recipe