tseemann / snippy

:scissors: :zap: Rapid haploid variant calling and core genome alignment
GNU General Public License v2.0
463 stars 113 forks source link

Add Samtools markdup to Snippy #124

Closed amilesj closed 6 years ago

amilesj commented 6 years ago

It would be great if users could implement an option in Snippy (something like "--dedup") that would mark read duplicates for removal via the newly available Samtools markdup (in Samtools Release 1.6).

tseemann commented 6 years ago

This is a good idea, and is planned.

Do you actually get a lot of duplicates in bacterial genome sequencing? We don't.

The only issue is that it will slow Snippy down, as it needs to sort by name for markdup, then resort back by coordinate.

amilesj commented 6 years ago

To be honest this is my first bacterial genomics project, so I don't have good insight into how often this is needed/how much of a difference it typically makes. I made the suggestion because it was a QC step that I saw frequently in mapping tutorials, and it was recommended to me by a collaborating group that analyzes microbial genomic data for clinical purposes.

My current project has about 3% duplicates according to Picard MarkDuplicates, but I only used 5 cycles of PCR in my library prep. I'm guessing it becomes more of an issue the more PCR that is used.

tseemann commented 6 years ago

@amilesj i have added markdup support in snippy_4.0 branch, which will be released soon.

I was surprised in the old days when we only had single-end reads and people used to mark duplicates! because if your coverage was > 100x and you used 100bp reads, you expect to get genuine duplicates.

But for paired-end data this is statistically less likely unless you have coverages >>10000x.

Now that I've added it, I see similar to what you say, about 1%-3% duplicates.

Keep an eye out for Snippy 4.0 soon.

amilesj commented 6 years ago

Awesome, thanks!