google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.17k stars 716 forks source link

Why does deepvariant report an allele if the allele does not exist in the sample? #618

Closed fardokhtsadat closed 1 year ago

fardokhtsadat commented 1 year ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md: Yes

Describe the issue: Hello,

Using WES model, deepvariant calls the following variant in the vcf file:

NC_000001.11    84574341    .   CAGCAGCGCT  C,T .   .   .   GT:GQ:DP:AD:VAF:PL  1/0:3:97:25,45,26:0.463918,0.268041:36,0,47,0,16,44

For this variant, the genotype is 1/0, meaning that one allele is REF, and the other allele is C. What is confusing is that deepvariant also calls a T however this is not referenced anywhere in the GT field.

What is the point of calling T if it does not occur in the sample?

Here is the screenshot of the original alignment:

dv1

And here is the screenshot for the realigned reads for this position:

dv2

Setup

Steps to reproduce:

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-quick-start.md. Is there any way to reproduce the issue by using the quick start?

Any additional context:

AndrewCarroll commented 1 year ago

Hi @fardokhtsadat

DeepVariant works in stages - make_examples and call_variants. Roughly, make_examples uses a set of heuristics similar to those used in other non-ML methods to realign reads and generate a list of potential positions to evaluate by the neural network. In the call_variants stage, a neural network assesses the probability that a candidate is not a variant, heterozygous, or homozygous.

In this variant case, after realignment occurred, two candidates were considered, a deletion CAGCAGCGCT -> C, a SNP C -> T at the same position, and reference. The reporting of allele information comes after realignment. Your realignment pileup shows that a T is present in reads in that position. To understand why in realignment these are chosen, we might have to get a snippet of your BAM file and look at the actual De Bruijn Graph to understand why the realignment scoring has this as preferred (and if any of those bases are from rescued soft-clip bases which can occur.

The neural network decided that the T SNP was spurious and the deletion the more likely outcome.

However, since all candidates are reported (even if a variant call is not made) and because this report in the VCF reports the supporting information observed after re-alignment, the VCF contains information for a rejected T call (as the genotype is 0/1 not 0/2) and the number of reads where a re-aligned T was observed.

The reason all potentially seen sites are reported is to allow filtering for higher sensitivity by users, if desired, and for joint calling where rejected sites across several samples could be considered as evidence to reverse a no-call decision. It is also convenient for debugging, as it indicates whether a call was not made because the neural network decided a reference call or because a candidate was not generated at that position.

fardokhtsadat commented 1 year ago

@AndrewCarroll Thank you for the information.

In the description above, you mentioned that deepvariant sees two candidates:

However, I thought deepvariant would detect the following as the two candidates (Applying vt decompose produces the same as the following):

Am I misunderstanding this?

Going back to the GT field, can it be that this behavior is changed in version 1.5.0? Looking results of deepvariant version 1.2.0, the GT field for all multi-allelic variants only supports the first alternate allele. I tried running deepvariant version 1.5.0 on the same alignment file, and from 47 multi-allelic variants, almost all have a GT field that supports both of the alternate alleles. Here is an example:

Here is an example: Result of deepvariant version 1.2.0:

NC_000001.11    6545786 .   C   A,T .   .   .   GT:GQ:DP:AD:VAF:PL  1/0:40:91:0,54,37:0.593407,0.406593:50,44,53,44,0,61

Result of deepvariant version 1.5.0:

NC_000001.11    6545786 .   C   A,T 57  PASS    .   GT:GQ:DP:AD:VAF:PL  1/2:48:91:0,54,37:0.593407,0.406593:57,52,61,52,0,66

However, from 47 multi-allelic variants, 4 have a GT field presented as 0/1. Is it safe to further process the VCF file to only keep the first allele for these 4 variants since deepvariant did not call them?

AndrewCarroll commented 1 year ago

Hi @fardokhtsadat

I realize in looking at the VCF entry, I was mistaken in saying DeepVariant sees a SNP and an Indel. In fact, it sees the candidates as you wrote.

CAGCAGCGCT -> C CAGCAGCGCT -> T

That's entirely on me, it should be clear from the VCF, and I don't have an excuse to make such a basic mistake. The call looks correct, it does look like a HET deletion of AGCAGCGCT. It looks like the realigner has decided to place a gap at this position and left-align the trailing T at the edge of the deletion so that when it is present it looks like a SNP. It's hard to authoritatively say why the realigner is going something without getting the actual BAM snippet and looking at the realignment. That part is in the heuristics of DeepVariant not the neural net. As a result of the re-alignment, DeepVariant sees the T SNP as a candidate but seems to correctly reject it. I think that's what's going on. In any case, a candidate here for the SNP seems to be a function of the realigner?

The behavior should not have changed from DeepVariant v1.2 to v1.5. In general, uncalled multiallelic events should be a relatively rare phenomenon.

Generally, if you need to, you can prune the alleles that are not called in DeepVariant VCF as long as the result of pruning stays compliant with the VCF spec expected by any downstream tools.

pichuan commented 1 year ago

Closing this issue because the last activity was more than a month ago.