google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.21k stars 722 forks source link

Deepvariant and GATK – GT inconsistency. #135

Closed HagenC closed 5 years ago

HagenC commented 5 years ago

Hi,

First of all I would like to thank you for your hard work – I really do appreciate it.

I have called variants using deepvariant v0.7.1 and GATK (4.0.8.0) using the same WGS bam.file and compared the results. Having only included SNP sites with DP => 10 there are 4.9% mismatches i.e where the GT is not the same. I would like to compare the measures AD,DP,GQ etc. of deepvariant and GATK in order to understand the discrepancies, but have not succeeded in finding any detailed documentation on how these are derived by deepvariant. GATKs HaplotypeCaller is realigning reads for each active region, including only informative reads and using pairHMM to calculate the likelihood of haplotypes before assigning the GT using Bayes rule. This seem a very good strategy on low coverage sites (<10), but can assign unlikely GTs if considering the ref/alt ratios in sites above 20x where a frequentist approach probably would be more appropriate. I can see that deepvariant do not exclude low quality reads/mappings in the AD and is generally higher in deepvariant compared to GATK AD - i.e all reads/bases are used in the call no matter the quality. Is this correct? The GQ assigned is in GATK just the second smallest PL – how do deepvariant calculate PL and GQ? Can you recommend a GQ cutoff for hardfiltering? It seems like the GT is assigned based on a combination of PL and VAF – is there a general equation for this?

If there exist a more detailed description of the algorithms and arguments used by deepvariant I would really appreciate it you could post a link.

Thanks and Happy New Year.

Cheers, C

AndrewCarroll commented 5 years ago

Hello C,

Thank you for taking the time to investigate DeepVariant.

With respect to DP and AD, reads considered have some minimum thresholds for inclusion. You can see these in deepvariant/make_examples.py lines 274-277. The default is a filter at MAPQ 10. You can see this manifest in the calculation of allele depth in deepvariant/allelecounter.cc line 289.

Above these thresholds, DeepVariant will count all of the reads present. However, there is an additional complexity. DeepVariant generates local a local of the ref and alt alleles by constructing a De Bruijn graph of each. It then performs a Smith-Waterman re-alignment of reads to the assemblies. You can think of this as a more exhaustive version of the candidate haplotype generation performed in GATK. As a result of this re-assembly, there may be some differences between the alleles that DeepVariant constructs and those constructed in GATK (and this may contribute to differences in AD). See the deepvariant/realigner folder for all of the associated code.

With respect to GT and GQ, these are the primary outputs of the convolutional neural network classifier. The classifier estimates the probability of HOM REF, HET, and HOM ALT states. The GT is the most probably state as determined by the classifier. The GQ should correspond to the likelihood calculated for that GT (and as a result, this should correspond to PL).

With respect to the calibration of GQ and recommendations for filtering. One observation we have about DeepVariant is that the genotype qualities seem to quite accurately reflect the empirical error probability (see Figure 2 Panel C of - https://www.nature.com/articles/nbt.4235). This fact, combined with the observation that the GQ scores produced by DeepVariant are quite normally distributed, means that you have flexibility to shift them slightly higher or lower if you prefer higher precision or higher recall.

If anything, DeepVariant seems to be slightly on the conservative side outside of the confident regions, so I would likely recommend you not perform additional hard filtering on GQ, and consider REF calls with GQ below 10 or 20 to be more like no-calls.

If you have not yet read the Nature Biotechnology manuscript, I would recommend that as another good overview https://www.nature.com/articles/nbt.423

If you are curious to compare calls between DeepVariant and other technologies, I would also recommend that you use metrics like TiTv ratio or dbSNP fraction on the calls that are shared and singletons between the two methods.

Thanks, Andrew

HagenC commented 5 years ago

Thanks for the reply.

I have analyzed some mismatches and here are some thoughts:

Looking at two mismatches, here is the first one:

Chr1:261478C>A :

GAKT have a minimum BaseQuality of 10 and MQ of 20, deepvariant 10 and 10.

Calculating the probability of AD distribution 31,12 (GATK) and 33,13 (deepvariant) is approximately 0.24% for deepvariant and 0,26% for GATK – so the probability that this distribution occurs assuming diploid and 50% chance of success the GT is most likely 0/0. GATKs GT is 0/1 with a GQ of 99 – it is however flagged a VQSRTrancheSNP99.90to100.00, but this is dependent on the other genomes in the sample( Im not a big fan of VQSR). Manually inspecting the variant using IGV I would annotate this as 0/1, but it is reasonable that it is assigned 0/0 considering the errors introduced by Illumnia sequencing are not random, even though with an error rate of 1:1000 it is highly unlikely that the same error occurs at the same site 13 times. (btw this is PCR-free WGS)

The variant chr1:202192T>C:

The probability is here for deepvariant 1.75% and GATK 22.6% - so the discrepancy of this variant makes sense. Manually inspecting the variant using IGV I would annotate this as 0/1. I guess that by including low quality MQ in this case decreases the GQ for deepvariant and increasing the min_mapping_quality to 20 could be worth trying for comparison. Is there an argument that can be used for this? (im using GCP for this project)

I have noticed that deepvariant call variants in soft-clipped regions, but is seems they are assigned 0/0 – this is great – is this an intentional feature in deepvariant algorithm? This could be one reason to the discrepancies as GATK by default also call in soft-clipped regions. I will try to excluded these regions from GATK HC and see what the impact is. Is there an argument to use in deepvariant to control this feature?

At the moment there is a bug in GATK (https://github.com/broadinstitute/gatk/issues/3697) which also could cause some discrepancies – I have noticed several calls where the alternative bases are completely ignored and I may be caused by this bug.

Cheers, C

akolesnikov commented 5 years ago

Hello C,

I cannot comment much on your examples without seeing IGV. Keep in mind that DeepVariant does a realignment, therefore some reads may align differently from what you see in original BAM.

As for the second example min_mapping_quality is set to 20 by default. It is possible to change this parameter in make_examples (make_examples is one of the steps and is called by GCP runner)

Regarding soft-clipped regions. As Andrew mentioned earlier, reads are realigned to haplotypes, therefore sometimes new alignment can be done without soft-clips (read can be aligned to haplotype perfectly, although when aligning the same read to reference we get a soft-clip) There is no way to control this behavior.

Thanks Alex

HagenC commented 5 years ago

Thanks for the reply.

Is it possible to get the realignments out as a bam file?

Thanks, Christian

akolesnikov commented 5 years ago

Unfortunately it is not possible as of today. You may add couple of lines of code to realigner.py to output realigned reads with cigars to console or text file and run make_example on a small window of interest. Also you may run DeepVariant without realigner using --norealign_reads flag.

HagenC commented 5 years ago

OK, thanks. This is probably an embarrassing question: I´m using GCP cloud shell and I am new to this, so how can I manipulate files in the docker images i.e changing the parameters in make_example.py? Is there a list of arguments/flags somewhere for make_example etc?.

For example from make_examples.py line 274-277: read_reqs = reads_pb2.ReadRequirements( min_base_quality=10, min_mapping_quality=10, min_base_quality_mode=reads_pb2.ReadRequirements.ENFORCED_BY_CLIENT)

Could on change the parameters to --make_examples_min_base_quality = 20?

Or maybe copy the image to a local bucket and manipulate them there? Sorry for taking so much of your time!

Cheers, Christian

gunjanbaid commented 5 years ago

@HagenC all flags for make_example.py can be found here (lines 87-207). For the min_base_quality that you reference, there is currently no flag set up to modify that value. If you want to modify that value, the best way to do so would be to modify the default in the source code or add a flag and then rebuild DeepVariant. I realize that this might be a bit tedious to do, and we will consider including a flag for that parameter later on.

gunjanbaid commented 5 years ago

@HagenC I want to clarify a bit more. Even though you cannot currently modify the value of min_base_quality, you CAN modify the existing flags, depending on how you are running DeepVariant. I was not sure what you meant by the GCP cloud shell. Are you running on one machine yourself or following this tutorial?

If you are running on one machine, you can take a look at the WGS case study. This is an older version of the document that shows how to pass different flags to each step (make_examples, call_variants, postprocess_variants). You can find the current version of the document here, but this uses Docker and does not show how to run each step with additional flags.

HagenC commented 5 years ago

@gunjanbaid I am following the tutorial https://cloud.google.com/genomics/docs/tutorials/deepvariant - so GCP WM with Debian 9.6 (4.14.74+ GNU/Linux) running the docker container that you have provided. I have access to a CentOS 7.5.1804 (3.10.0-862.3.3.el7.x86_64 GNU/Linux) system, however, I do not have full admin rights as it a shared system and there are several post about issues with CentOS 7 and not having root permissions. Unfortunately I don´t have the time to play around with this at the moment – I’m expecting there will be a lot of trouble shooting and work arounds!. Thanks for all your help.

gunjanbaid commented 5 years ago

@HagenC no problem! Let me know if there is anything else I can help with. Otherwise, I will close this issue.

gunjanbaid commented 5 years ago

@HagenC We have internally added both min_base_quality and min_mapping_quality as flags in make_examples and these will come out in the next DeepVariant release.