google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.23k stars 725 forks source link

What does exactly "PASS" mean? #278

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello, I am curious about the meaning of "PASS" actually, how exactly is this evaluated? I noticed that filtering on "PASS" I am missing some variants. For examples, some variants with a GQ of 50 have a "." in the QUAL column. Example, it's a multi sample experiment chromosome_6 9454294 . G <*> 0 . END=9454294 GT:GQ:MIN_DP:PL 0/0:50:296:0,240,2879 And in another sample

chromosome_6 9454294 . G C,<*> 21.7 PASS . GT:GQ:DP:AD:VAF:PL 0/1:22:176:96,79,0:0.448864,0:21,0,99,990,990,990

What kind of info does DeepVariant use to decide if a variant pass or not?

Thank you!

AndrewCarroll commented 4 years ago

Hi @aderzelle

PASS is a commonly used value for the Filter field in VCFs. We use this to align with the terminology frequently used for variant calls. When an entry has PASS, it means that a candidate was generated and the neural network classifier gave the probability of the non-reference genotype call as higher than for reference. When an entry has RefCall present, it means that a candidate was generated and the neural network classifier gave a higher probability for a reference call. When there is no entry, this should correspond to gVCF blocks where no candidate was generated, so the neural network did not classify the position (the call is reference but made in a different way).

So PASS or not pass relates to the main output of DeepVariant.

ghost commented 4 years ago

That makes a lot of sense, thank you. So RefCall is not equivalent to the absence of entry. In case you would find this interesting, in the following example, the third T is set as RefCall in one vcf, and as a A to T SNP in another one


chromosome_1    10764356    .   A   T,<*>   0   RefCall .   GT:GQ:DP:AD:VAF:PL  0/0:36:290:223,64,0:0.22069,0:0,35,56,990,990,990

chromosome_1    10764356    .   A   T,<*>   26.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:161:98,63,0:0.391304,0:26,0,64,990,990,990

This is a bit mysterious to me why it does so; except that one of my sample has a very high coverage (> 300) and maybe this confuses DeepVariant regarding the allele ratio, maybe it thinks that 64/223 is not high enough. This region seems quite complexe in fact.

example

ghost commented 4 years ago

Mmmm actually it seems Deepvariant behaves unexpectedly in the neighbour of multi nucleotide variants. For example


zgrep -w "chromosome_1" output.g.vcf.gz|grep -C 3 "10764356"
chromosome_1    10764353    .   C   T,<*>   0   RefCall .   GT:GQ:DP:AD:VAF:PL  0/0:36:219:152,66,0:0.30137,0:0,35,50,990,990,990
chromosome_1    10764354    .   A   G,<*>   22.7    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:23:287:217,66,0:0.229965,0:22,0,52,990,990,990
chromosome_1    10764355    .   C   T,<*>   21.8    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:22:291:153,133,0:0.457045,0:21,0,52,990,990,990
chromosome_1    10764356    .   A   T,<*>   0   RefCall .   GT:GQ:DP:AD:VAF:PL  0/0:36:290:223,64,0:0.22069,0:0,35,56,990,990,990
chromosome_1    10764357    .   T   <*> 0   .   END=10764357    GT:GQ:MIN_DP:PL 0/0:50:292:0,300,2999
chromosome_1    10764358    .   T   TTC,<*> 30.4    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:30:300:156,69,0:0.23,0:30,0,65,990,990,990
chromosome_1    10764359    .   T   <*> 0   .   END=10764381    GT:GQ:MIN_DP:PL 0/0:50:301:0,300,2999

In 2 other samples for the same site

 zgrep -w "chromosome_1" H4A4.g.vcf.gz|grep -C 3 "10764356" 
chromosome_1    10764351    .   GAC G,<*>   30.5    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:31:41:16,25,0:0.609756,0:30,0,53,990,990,990
chromosome_1    10764354    .   A   G,<*>   30.2    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:30:40:15,25,0:0.625,0:30,0,51,990,990,990
chromosome_1    10764355    .   C   T,<*>   28.6    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:29:41:15,26,0:0.634146,0:28,0,53,990,990,990
chromosome_1    10764356    .   A   <*> 0   .   END=10764357    GT:GQ:MIN_DP:PL 0/0:50:40:0,120,1199
chromosome_1    10764358    .   T   TTC,<*> 31.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:31:41:14,27,0:0.658537,0:31,0,60,990,990,990
chromosome_1    10764359    .   T   <*> 0   .   END=10764381    GT:GQ:MIN_DP:PL 0/0:50:39:0,123,1229
chromosome_1    10764382    .   T   G,<*>   38.2    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:36:40:25,15,0:0.375,0:38,0,40,990,990,990

zgrep -w "chromosome_1" controlH.g.vcf.gz|grep -C 3 "10764356" 

chromosome_1    10764353    .   C   T,<*>   27.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:27:162:100,62,0:0.382716,0:27,0,51,990,990,990
chromosome_1    10764354    .   A   <*> 0   .   END=10764354    GT:GQ:MIN_DP:PL 0/0:50:162:0,300,2999
chromosome_1    10764355    .   C   T,<*>   26.4    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:162:100,62,0:0.382716,0:26,0,60,990,990,990
chromosome_1    10764356    .   A   T,<*>   26.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:161:98,63,0:0.391304,0:26,0,64,990,990,990
chromosome_1    10764357    .   T   <*> 0   .   END=10764357    GT:GQ:MIN_DP:PL 0/0:50:162:0,300,2999
chromosome_1    10764358    .   T   C,<*>   33.2    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:33:172:98,63,0:0.366279,0:33,0,69,990,990,990
chromosome_1    10764359    .   T   <*> 0   .   END=10764381    GT:GQ:MIN_DP:PL 0/0:50:169:0,300,2999

To me in the 3 samples it's the same site that is present.

Here another one (where the SNP is RefCall in one sample and PASS in another)

zgrep -w "chromosome_2" output.g.vcf.gz|grep -C 2 "9780248"
chromosome_2    9780195 .   T   <*> 0   .   END=9780244 GT:GQ:MIN_DP:PL 0/0:50:294:0,300,2999
chromosome_2    9780245 .   GGT G,<*>   37.4    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:36:294:161,131,0:0.445578,0:37,0,41,990,990,990
chromosome_2    9780248 .   A   <*> 0   .   END=9780249 GT:GQ:MIN_DP:PL 0/0:50:163:0,270,2939
chromosome_2    9780250 .   T   TTG,<*> 36.8    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:32:298:161,133,0:0.446309,0:36,0,34,990,990,990
chromosome_2    9780251 .   A   <*> 0   .   END=9780281 GT:GQ:MIN_DP:PL 0/0:50:272:0,300,2999

In this second case, I don't understand why DeepVariant did not even consider there might be a variant there, as the bam clearly shows many reads mapping in that position with a variant site (it's the middle T flanked by 2 homozygous T sites).

example2

Now I know that calling SNP (in the sense of single nucleotide) variation in the vicinity of more complex events is known to be tricky, therefore this might not be an issue with DeepVariant.

ghost commented 4 years ago

Sorry for the string of messages, but a bit of background: my samples controlH is a direct descendant of the "output" one, it is a clonal organism so I am assuming all variants private to the controlH are false calls. By investigating them visually (eyeballing the bam file) it seems they are indeed false positives. What is interesting is that they all seem to be in the direct vicinity of an indel or string of mismatches. Is it a known problem with DeepVariant, miscalls/ inconsistencies between samples, for sites around indels?

gunjanbaid commented 4 years ago

Hi @aderzelle, in the case of position 10764356, the allele fraction is a bit lower in the case of the RefCall. A few other implementation details that might cause the differences between samples: DeepVariant randomly samples the reads as the pileup images generated can accommodate at most 100 reads. In the case of high coverage regions, the observed allele fraction can change as a result of this downsampling. Sampling for a particular sample is deterministic, but may happen differently across samples. Another source of difference between the three samples might be caused the realigner. DeepVariant runs a realignment that can be turned off by adding the flag --norealign_reads to the make_examples step. Turning the realigner off entirely will likely hurt overall accuracy, but for this example, it might be useful to see if that's affecting the results.

Regions with many nearby variants do end up being challenging for the neural network to correctly classify. However, in the case of position 9780248, it is surprising that a candidate was not generated. Candidate generation should not be affected by the nearby variants.

gunjanbaid commented 4 years ago

@aderzelle I'll go ahead and close this issue, but feel free to reopen if you still have pending questions. Thanks!