google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.25k stars 728 forks source link

Strange GT value #592

Closed li1ba closed 1 year ago

li1ba commented 2 years ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.4/docs/FAQ.md: Yes

Describe the issue: (A clear and concise description of what the issue is.)

A variant with VAF value 1 is called as heterozygous. The IGV visualisation of the .bam file shows that it should clearly be a homozygous variant.

igv_panel_chr2_24146804

Here the line from .vcf file:

chr2    24146804    .   C   T   29.5    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:3:162:0,162:1:26,0,0

.gvcf file:

chr2    24146804    .   C   T,<*>   29.5    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:3:162:0,162,0:1,0:26,0,0,990,990,990

We also asked our collaborators to run the same sample and in their results, a homozygous variant is called:

chr2    24146804    .   C   T   30.8    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:5:161:0,161:1:28,2,0
chr2    24146804    .   C   T,<*>   30.8    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:5:161:0,161,0:1,0:28,2,0,990,990,990

What could cause this discrepancy, if the DeepVariant versions and commands are the same?

Setup

Steps to reproduce:

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-quick-start.md. Is there any way to reproduce the issue by using the quick start? No.

AndrewCarroll commented 2 years ago

Hi @li1ba

Thank you for the report, and for the VCF lines and IGV screenshot which are very informative. there are a few items to discuss:

1) Difference in genotype between runs

The GT from the two runs indicates that the neural network assesses the probability of 0/1 and 1/1 calls to be very similar (in the first entry they are rounded identically in the PL field). Your collaborator calls do have a small lean toward 1/1. DeepVariant should give identical results when the same version is run on the same hardware platform (e.g. CPU-CPU). However, there can be floating point differences with very minor (almost never at the level of the variant call, but mostly at the GQ level) between compute platforms. Can you confirm that you and your collaborator ran the exact same version of DeepVariant on the same compute platform, or if there might be a difference (e.g. CPU vs Parabricks GPU).

2) Why is the call here 0/1 given the pileup

The IGV screenshot you've attached certainly looks 1/1, and all experts will assess it as a 1/1 from what is shown. We have observed that in rare circumstances, DeepVariant will occasionally call such positions as 0/1 or 0/0 or to decrease the confidence in the calls of certain positions. The signature for this seems to be when DeepVariant assesses a region to be likely to be a segmental duplication or structural variant. Signatures for that often involve a haplotype with dense variants while another haplotype is almost entirely reference, or a high amount of discordant read mapping or low MAPQ. Although your pileup does have a discordantly mapped read present, those patterns generally aren't present.

For some reason, in both your and your collaborator's run, DeepVariant seems to think that this region is difficult to call.

In these cases, I'm always interested to see whether this could highlight a bug in DeepVariant, or if it reflects some learning of the model. Would there be any chance for you to share the small window of the BAM file here (maybe +/- 500 bp from the position). People in the team would be interested in whether this could detect any sort of edge case DeepVariant isn't handling well.

Thanks, Andrew

li1ba commented 2 years ago

Hi Andrew!

Thanks for looking into this!

  1. Our collaborators use Ubuntu18.04, CPU and the same DeepVariant version (1.2.0) with Docker. The results might not be the same because they didn't use the same .bam file, and they have a slightly different mapping command.
  2. I have attached a small window of the BAM file, as you asked. I could also share the whole BAM file per e-mail if you would be interested, as we saw multiple such variants with VAF=1 and GT=0/1.

sample.zip

AndrewCarroll commented 2 years ago

Hi @li1ba

Thank you for the BAM file, I've been able to download it and look at the region in IGV. To my eye, I can't see the reason DeepVariant is calling 0/1 based on the pileup I see. We're going to run a few experiments with this and see if we can identify either something DeepVariant sees that we don't or if this is highlighting some sort of edge case or bug in the code.

AndrewCarroll commented 1 year ago

Hi @li1ba

For the question about why does the model have a difficult time calling this 1/1 versus 0/1, we have done further investigation.

First, it doesn't look like this is a bug in pre-processing or how the data is represented. It seems to be a property of the model.

Second, the property of the model seems to reflect something learned about exome sequencing as opposed to WGS. To determine that, we ran your snippet through both the WGS and WES models. The WGS model is able to confidently call this site as 1/1:

chr2    24146804        .       C       T       34.1    PASS    .       GT:GQ:DP:AD:VAF:PL      1/1:34:162:0,162:1:34,45,0

When we run the WES model, we replicate your finding (this is with the most recent DeepVariant v1.4, so there is a small difference in the GQ values.

chr2    24146804        .       C       T       29.3    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:4:162:0,162:1:26,0,1

This suggests that there is some aspect of exome sequencing that the DeepVariant WES model has learned makes this variant difficult to genotype, possibly because there is some signal that indicates only one allele is present. The reason for this might be some factor which isn't understood (at least by me). This is an interesting observation, but really understanding the reason for a 0/1 call at this position would probably need more investigation (for example, by going into the GIAB training labels for exome sequencing and seeing how often positions that look like this are 0/1 versus 1/1 and trying to understand why).

With respect to why your collaborator has different results from you, it's very likely that the difference in mappers has a small effect on which reads are present and how many map discordantly. This small difference pushes the output for the variant on the edge of probabilities between 0/1 an 1/1 to the other side.

AndrewCarroll commented 1 year ago

Hi @li1ba

I have a follow-up question on this issue. It looks like these reads are 100bp exome reads. We may have identified an issue which could cause genotype calls of this nature in this specific data type. We have a model which we think will address this issue. If you are interested to try running that model, can you email awcarroll@google.com and I can send you the model and instructions on how to run it.

pichuan commented 1 year ago

Hi @li1ba , this problem should be fixed in the latest release: https://github.com/google/deepvariant/releases/tag/v1.5.0

Let us know if it works. I'll close this issue for now.