google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.12k stars 703 forks source link

How does DeepVariant decide on Call or No-Call? #639

Closed Dani-kolbe closed 1 year ago

Dani-kolbe commented 1 year ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md: Yes

Describe the issue: I do not understand how Deepvariant decides if there is a call at sites with low/zero GQ/DP... below is an example of a merged VCF with GLNexus and the same sites in the individual gVCFs.

Setup

Steps to reproduce:

/opt/deepvariant/bin/run_deepvariant \
    --model_type WES \
    --ref ${ref} \
    --reads ${cram_in} \
    --regions ${regions} \
    --output_gvcf ${sample}.g.vcf.gz \
    --output_vcf ${sample}.vcf.gz \
    --num_shards 8 \

GLNexus:

glnexus_cli --config DeepVariantWES --bed /work_beegfs/***/exomes/hg38_exomregions_withoutCHR.bed /work_beegfs/***/exomes/2_gvcf/*.g.vcf.gz > /work_beegfs/***/exomes/3_bcf/Exomes.bcf
bcftools view ../3_bcf/Exomes.bcf | bgzip -@ 4 -c > Exomes.vcf.gz
bcftools index Exomes.vcf.gz

Merged VCF with GLNexus:

1 69897 1_69897_T_C T C 48 . AF=0.076465;AQ=48 GT:DP:AD:GQ:PL:RNC 0/0:0:0,0:1:0,3,29:.. 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,3,29:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,3,29:.. 0/0:0:0,0:1:0,0,0:.. 1/1:3:0,3:15:31,19,0:.. 0/0:0:0,0:1:0,3,29:.. ./.:6:2,4:8:0,8,13:II ./.:1:1,0:0:29,3,0:II 1/1:2:0,2:7:13,11,0:.. ./.:1:1,0:0:29,3,0:II

Same site/samples gVCF (each line is one sample in same order as merged file above):


1   69792   .   T   <*> 0   .   END=70036   GT:GQ:MIN_DP:PL 0/0:1:0:0,3,29
1   69638   .   C   <*> 0   .   END=70036   GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
1   69850   .   C   <*> 0   .   END=69929   GT:GQ:MIN_DP:PL 0/0:1:0:0,3,29
1   69897   .   T   <*> 0   .   END=69897   GT:GQ:MIN_DP:PL ./.:0:1:29,3,0
1   69683   .   T   <*> 0   .   END=69911   GT:GQ:MIN_DP:PL 0/0:1:0:0,3,29
1   69037   .   G   <*> 0   .   END=70036   GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
1   69897   .   T   C,<*>   31.7    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:19:3:0,3,0:1,0:31,19,0,990,990,990
1   69848   .   G   <*> 0   .   END=69920   GT:GQ:MIN_DP:PL 0/0:1:0:0,3,29
1   69897   .   T   C,<*>   0.7 RefCall .   GT:GQ:DP:AD:VAF:PL  ./.:8:6:2,4,0:0.666667,0:0,8,13,990,990,990
1   69897   .   T   <*> 0   .   END=69897   GT:GQ:MIN_DP:PL ./.:0:1:29,3,0
1   69897   .   T   C,<*>   14.3    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:10:2:0,2,0:1,0:13,11,0,990,990,990
1   69897   .   T   <*> 0   .   END=69897   GT:GQ:MIN_DP:PL ./.:0:1:29,3,0

Seems as long as GQ is >0 there is a call? That seems very low. I do not want to start manually filterting sites as this defeats the purpose of using DeepVariant, but some clarification would be much appreciated. Thank you!

AndrewCarroll commented 1 year ago

Hi @Dani-kolbe

To answer this, we need to talk through how DeepVariant handles no-calls and how GLnexus then acts on that information.

There are two different ways that DeepVariant can make a no-call/reference call for a position. DeepVariant first looks through the genome for any positions with some evidence that a variant may exist (for example any position where 2 reads support a variant and more than 12% allele frequency). The neural net assigns a genotype probability for all of these positions. The confidence that the neural network has in the variant call is represented in the GQ field (which is the phred-encoded probability that the assigned genotype is correct). When the probability for reference is >50% and 99% (GQ20), the genotype assigned is ./. when the probability for reference is >99% (GQ20+) the genotype assigned is 0/0.

In your gVCF calls, only this row is a REF call made by the neural net:

1   69897   .   T   C,<*>   0.7 RefCall .   GT:GQ:DP:AD:VAF:PL  ./.:8:6:2,4,0:0.666667,0:0,8,13,990,990,990

The second way that DeepVariant makes a no-call/reference-call is in the process of gVCF generation. The gVCF was designed as a way to encode the probability of a region with no variant call information is a reference, so that later in joint genotyping of many samples, you can potentially change the genotype call based on what you see in the population.

In your gVCF calls all other REF calls fall into this category.

DeepVariant makes a gVCF with REF-CALL blocks over stretches where it observes no candidates, with a heuristic logic based on coverage and support determining the probability of these calls. In calls of this nature, there is a different logic that does not apply the GQ20 threshold. In joint genotyping, GLnexus will use the probabilities to determine the VCF call.

In summary, a variant (0/1 or 1/1) will have a variant call if it is the most likely genotype at the position. A position will receive a 0/0 call if the model observed a site with >GQ20

It can make sense to filter DeepVariant results if you have a substantial preference for precision. In addition to higher overall accuracy, we try to make DeepVariant report well-calibrated confidence probabilities. If you need higher precision for the variant calls, filtering on the call GQ is the best field.

This was a long answer. Please let me know what areas remain unclear after reading it.

Thank you, Andrew

Dani-kolbe commented 1 year ago

Hi Andrew, Thank you very much for your response, this has definitely helped clarify the situation!

Best wishes, Daniel