google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.25k stars 727 forks source link

Difference between WES/WGS error models #329

Closed anands-repo closed 4 years ago

anands-repo commented 4 years ago

I wonder what is the error model difference between WGS and WES. Is it simply the way coverage varies, or is there any difference in the error rates/types?

Also for the open training data, I noticed that the BAM files are named *deduplicated.bam. What is the method used to do mark duplication? Is it GATK?

AndrewCarroll commented 4 years ago

Hi @anands-repo

There are a large number of differences in the error profile of WGS and WES, including capture kit efficiency, additional errors from PCR in preparation, differences in the amount of on-target reads, greater coverage variability in exomes, and probably many other factors that are not completely understood. This paper: https://www.pnas.org/content/112/17/5473 is probably a good place to start on some of the factors that differ between the assays.

For deduplication, we use Picard MarkDuplicates as run by GATK. We observe only very negligible differences in variant call quality with and without MarkDuplicates, which only become observable at lower coverages (15x-22x). This is one reason we indicate MarkDuplicates as an optional step in our BestPractices.

anands-repo commented 4 years ago

Thanks for the material and the answers!