Closed amilesj closed 6 years ago
This is a good idea, and is planned.
Do you actually get a lot of duplicates in bacterial genome sequencing? We don't.
The only issue is that it will slow Snippy down, as it needs to sort by name for markdup, then resort back by coordinate.
To be honest this is my first bacterial genomics project, so I don't have good insight into how often this is needed/how much of a difference it typically makes. I made the suggestion because it was a QC step that I saw frequently in mapping tutorials, and it was recommended to me by a collaborating group that analyzes microbial genomic data for clinical purposes.
My current project has about 3% duplicates according to Picard MarkDuplicates, but I only used 5 cycles of PCR in my library prep. I'm guessing it becomes more of an issue the more PCR that is used.
@amilesj i have added markdup
support in snippy_4.0 branch, which will be released soon.
I was surprised in the old days when we only had single-end reads and people used to mark duplicates! because if your coverage was > 100x and you used 100bp reads, you expect to get genuine duplicates.
But for paired-end data this is statistically less likely unless you have coverages >>10000x.
Now that I've added it, I see similar to what you say, about 1%-3% duplicates.
Keep an eye out for Snippy 4.0 soon.
Awesome, thanks!
It would be great if users could implement an option in Snippy (something like "--dedup") that would mark read duplicates for removal via the newly available Samtools markdup (in Samtools Release 1.6).