google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.18k stars 718 forks source link

Deepvariant genotype #838

Closed MKaandemir closed 2 months ago

MKaandemir commented 3 months ago

Hi,

Thanks for the great tool. I got the following variant lines. I wonder how should I handle them since they are germline calls. The input bam consist of only disease causing tandem repeat regions. Would I get the following lines if I run whole genome bam? If so, how should I handle these cases?

chr8    118316369   .   CA  C,CAAAAAAAAAAA,CAAAAAAAAAAAA    34.6    PASS    .   GT:GQ:DP:AD:VAF:PL:PS   2|1:4:36:5,9,7,5:0.25,0.194444,0.138889:32,12,45,12,0,46,6,4,46,46:118261886
chr8    118620399   .   CAAAA   C,CA,CAAA   33.3    PASS    .   GT:GQ:DP:AD:VAF:PL:PS   3|1:4:25:2,8,3,6:0.32,0.12,0.24:30,8,43,3,43,43,8,0,10,41:118384801
MKaandemir commented 3 months ago

I looked at it again. The first line should be like this, right?

chr8 118316369 . CA C
chr8 118316370 . A AAAAAAAAAAA,AAAAAAAAAAAA
kishwarshafin commented 3 months ago

Hi @MKaandemir,

chr8    118316369   .   CA  C,CAAAAAAAAAAA,CAAAAAAAAAAAA    34.6    PASS    .   GT:GQ:DP:AD:VAF:PL:PS   2|1:4:36:5,9,7,5:0.25,0.194444,0.138889:32,12,45,12,0,46,6,4,46,46:118261886

In this case, the genotype is 2|1. You can interpret this as:

allele0=ref (CA)
allele1=alt1 (CA->C)
allele2=alt2 (CA->CAAAAAAAAAAA)
allele3=alt3 (CA->CAAAAAAAAAAAA)

As the genotype is 2|1 you can interpret it as: alt2|alt1 So the first haplotype sees:

CA->CAAAAAAAAAAA

And second haplotype sees:

CA->C

So in haplotype-1 you have 10bp insertion of As and 2nd one you have a deletion of 1bp A. You can represent this many ways. However, if you left shift, it would become:

chr8 118316369 . CA C,CAAAAAAAAAAA

Which is equivalent to what you had in the VCF. You can use bcftools norm or something else if you are trying to normalize the variant call. Anyway you represent that gives you the right underlying haplotype should be the right way to represent it unless you are looking for something specific.

MKaandemir commented 3 months ago

Thanks for the explanation! I'm also curious about why there are four allelic depths. Is the first one the reference allele's depth? In biallelic SNPs, it doesn't show the reference allele. Also, why do you show the 2 allele's genotype but put the 3 allele in the alt column?

kishwarshafin commented 3 months ago

Hi @MKaandemir from the header you can see the description of each field:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=MED_DP,Number=1,Type=Integer,Description="Median DP observed within the GVCF block rounded to the nearest integer."

Yes, the first value for AD is for the reference allele.

MKaandemir commented 3 months ago

In the example line, the depth is listed as 36. However, the allelic depths are 5, 9, 7, and 5. The sum of these allelic depths does not equal the value in the DP field.

chr8    118316369   .   CA  C,CAAAAAAAAAAA,CAAAAAAAAAAAA    34.6    PASS    .   GT:GQ:DP:AD:VAF:PL:PS   2|1:4:36:5,9,7,5:0.25,0.194444,0.138889:32,12,45,12,0,46,6,4,46,46:118261886
kishwarshafin commented 3 months ago

@MKaandemir that means there were more alleles in this position with lower frequency that were dropped by the candidate generation scheme as they do not meet all the heuristics set for an allele to be a candidate. You can read the DeepVariant manuscript to understand the process fully.

kishwarshafin commented 2 months ago

Closing this issue. Please feel free to reopen if you have further questions.