This started as a PR for replacing picard MarkDuplicates with samtools markdup. Turns out samtools' implementation is prohibitively memory hungry. (Tested allocating 3X bam size in memory and still had failures.) But while doing a bit of a deep dive on the samtools/picard MD implementations, I discovered that the algorithms' distinction between optical and non-optical duplicates is built on shaky ground (IMO at least). I don't think it's a sound analysis for us to be doing in QC. And to my knowledge the portion of optical VS non-optical duplicates has never been used by us during QC, so it should be safe to drop.
Cool side effect, it's going to save a chunk of change on running DNA samples through QC, as MD took a long time with a decent memory allocation (50gb).
Duplicate counts can still be discovered by samtools flagstat output.
This started as a PR for replacing
picard MarkDuplicates
withsamtools markdup
. Turns out samtools' implementation is prohibitively memory hungry. (Tested allocating 3X bam size in memory and still had failures.) But while doing a bit of a deep dive on the samtools/picard MD implementations, I discovered that the algorithms' distinction between optical and non-optical duplicates is built on shaky ground (IMO at least). I don't think it's a sound analysis for us to be doing in QC. And to my knowledge the portion of optical VS non-optical duplicates has never been used by us during QC, so it should be safe to drop.Cool side effect, it's going to save a chunk of change on running DNA samples through QC, as MD took a long time with a decent memory allocation (50gb).
Duplicate counts can still be discovered by samtools flagstat output.