performance bottleneck - Githubissues

jpuritz / dDocent

a bash pipeline for RAD sequencing

MIT License

52 stars 41 forks source link

It just occurred to me that https://github.com/jpuritz/dDocent/blob/3615f8e2a405eb35595b77c60955c0a3dbfe15e1/dDocent#L364 is a performance bottleneck because the mapping process all but halts to sort the single bam file before finishing the iteration. Would you be interested in me modifying this/these loops to democratize it a bit so the sorting exploits the number of threads more? My thinking is that the options are:

adding more available threads in the loop body for samtools sort
move the sorting to outside the loop body and into a parallel call

As it is now, the mapping part of dDocent maximizes threading per individual rather than across individuals (I'm a fan, as it quickens time-till-first-error), so my guess is that same strategy can be applied to the samtools sort call, or just split sort jobs across threads.

Do you have any thoughts on this?

While not the end of the world, I tested the second option of moving the samtools sort out of the loop body and got a bit of a speedup working with 10 files: dDocent 29577.63s user 583.83s system 682% cpu 1:13:37.58 total

dDocent_mod
28304.44s user 569.33s system 765% cpu 1:02:52.20 total

# sort
find ./ -name "*.bam" | parallel --jobs $(( $NUMProc / $SAMProc )) samtools sort -@$SAMProc {} -o {} "2>>" {}.log
# rename to -RG.bam. This parallele {.} trick removes the last extension from the input filename 
find . -name "*.bam" | parallel --jobs 1 mv {} {.}-RG.bam 
# index output bam files
find ./ -name "*-RG.bam" | parallel --jobs $NUMProc samtools index {}

If you're cool with this, I'd like to submit a PR for this and #68 , though the latter will need to add bedops as a dep to install_ddocent and the conda recipe

jpuritz / dDocent

performance bottleneck #72