Closed pdimens closed 3 years ago
While not the end of the world, I tested the second option of moving the samtools sort
out of the loop body and got a bit of a speedup working with 10 files:
dDocent
29577.63s user
583.83s system
682% cpu
1:13:37.58 total
dDocent_mod
28304.44s user
569.33s system
765% cpu
1:02:52.20 total
# sort
find ./ -name "*.bam" | parallel --jobs $(( $NUMProc / $SAMProc )) samtools sort -@$SAMProc {} -o {} "2>>" {}.log
# rename to -RG.bam. This parallele {.} trick removes the last extension from the input filename
find . -name "*.bam" | parallel --jobs 1 mv {} {.}-RG.bam
# index output bam files
find ./ -name "*-RG.bam" | parallel --jobs $NUMProc samtools index {}
If you're cool with this, I'd like to submit a PR for this and #68 , though the latter will need to add bedops as a dep to install_ddocent
and the conda recipe
It just occurred to me that https://github.com/jpuritz/dDocent/blob/3615f8e2a405eb35595b77c60955c0a3dbfe15e1/dDocent#L364 is a performance bottleneck because the mapping process all but halts to sort the single bam file before finishing the iteration. Would you be interested in me modifying this/these loops to democratize it a bit so the sorting exploits the number of threads more? My thinking is that the options are:
samtools sort
parallel
callAs it is now, the mapping part of dDocent maximizes threading per individual rather than across individuals (I'm a fan, as it quickens time-till-first-error), so my guess is that same strategy can be applied to the
samtools sort
call, or just split sort jobs across threads.Do you have any thoughts on this?