Closed Dani-kolbe closed 1 year ago
Hi @Dani-kolbe
To answer this, we need to talk through how DeepVariant handles no-calls and how GLnexus then acts on that information.
There are two different ways that DeepVariant can make a no-call/reference call for a position. DeepVariant first looks through the genome for any positions with some evidence that a variant may exist (for example any position where 2 reads support a variant and more than 12% allele frequency). The neural net assigns a genotype probability for all of these positions. The confidence that the neural network has in the variant call is represented in the GQ field (which is the phred-encoded probability that the assigned genotype is correct). When the probability for reference is >50% and 99% (GQ20), the genotype assigned is ./. when the probability for reference is >99% (GQ20+) the genotype assigned is 0/0.
In your gVCF calls, only this row is a REF call made by the neural net:
1 69897 . T C,<*> 0.7 RefCall . GT:GQ:DP:AD:VAF:PL ./.:8:6:2,4,0:0.666667,0:0,8,13,990,990,990
The second way that DeepVariant makes a no-call/reference-call is in the process of gVCF generation. The gVCF was designed as a way to encode the probability of a region with no variant call information is a reference, so that later in joint genotyping of many samples, you can potentially change the genotype call based on what you see in the population.
In your gVCF calls all other REF calls fall into this category.
DeepVariant makes a gVCF with REF-CALL blocks over stretches where it observes no candidates, with a heuristic logic based on coverage and support determining the probability of these calls. In calls of this nature, there is a different logic that does not apply the GQ20 threshold. In joint genotyping, GLnexus will use the probabilities to determine the VCF call.
In summary, a variant (0/1 or 1/1) will have a variant call if it is the most likely genotype at the position. A position will receive a 0/0 call if the model observed a site with >GQ20
It can make sense to filter DeepVariant results if you have a substantial preference for precision. In addition to higher overall accuracy, we try to make DeepVariant report well-calibrated confidence probabilities. If you need higher precision for the variant calls, filtering on the call GQ is the best field.
This was a long answer. Please let me know what areas remain unclear after reading it.
Thank you, Andrew
Hi Andrew, Thank you very much for your response, this has definitely helped clarify the situation!
Best wishes, Daniel
Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md: Yes
Describe the issue: I do not understand how Deepvariant decides if there is a call at sites with low/zero GQ/DP... below is an example of a merged VCF with GLNexus and the same sites in the individual gVCFs.
Setup
Steps to reproduce:
DeepVariant:
GLNexus:
Merged VCF with GLNexus:
1 69897 1_69897_T_C T C 48 . AF=0.076465;AQ=48 GT:DP:AD:GQ:PL:RNC 0/0:0:0,0:1:0,3,29:.. 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,3,29:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,3,29:.. 0/0:0:0,0:1:0,0,0:.. 1/1:3:0,3:15:31,19,0:.. 0/0:0:0,0:1:0,3,29:.. ./.:6:2,4:8:0,8,13:II ./.:1:1,0:0:29,3,0:II 1/1:2:0,2:7:13,11,0:.. ./.:1:1,0:0:29,3,0:II
Same site/samples gVCF (each line is one sample in same order as merged file above):
Seems as long as GQ is >0 there is a call? That seems very low. I do not want to start manually filterting sites as this defeats the purpose of using DeepVariant, but some clarification would be much appreciated. Thank you!