stjudecloud / workflows

Bioinformatics workflows developed for and used on the St. Jude Cloud project.
MIT License
34 stars 10 forks source link

feat: don't MarkDuplicates on DNA #138

Closed a-frantz closed 8 months ago

a-frantz commented 8 months ago

This started as a PR for replacing picard MarkDuplicates with samtools markdup. Turns out samtools' implementation is prohibitively memory hungry. (Tested allocating 3X bam size in memory and still had failures.) But while doing a bit of a deep dive on the samtools/picard MD implementations, I discovered that the algorithms' distinction between optical and non-optical duplicates is built on shaky ground (IMO at least). I don't think it's a sound analysis for us to be doing in QC. And to my knowledge the portion of optical VS non-optical duplicates has never been used by us during QC, so it should be safe to drop.

Cool side effect, it's going to save a chunk of change on running DNA samples through QC, as MD took a long time with a decent memory allocation (50gb).

Duplicate counts can still be discovered by samtools flagstat output.