gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

bammarkduplicates2 - improving throughput #38

Open keiranmraine opened 7 years ago

keiranmraine commented 7 years ago

Hi,

I'm running biobambam2 on a high cpu/RAM host and have been attempting to improve the throughput by modifying various parameters but I'm not having much luck.

I'm assuming that to modify the profile I need to change several parameters together, it's also not clear where in the process I should expect 'markthreads' to affect the timings in the log (or if high values here are actually detrimental).

The process involves many input files.

Thanks

P.s. I found this thread from the biobambam repo (now gone) but not sure if it's still relevant:

There are some things coming up, but they are not production ready yet. If you wanted to do the task with biobambam I would suggest to perform the following steps:

  • merge the files: bammerge level=0 in1.bam in2.bam ... in5.bam | bamrecompress numthreads=64 > merged.bam
  • run the duplicate marking: bammarkduplicates2 I=merged.bam O=marked.bam M=marked.metrics inputthreads=64 outputthreads=64 markthreads=64