google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.24k stars 727 forks source link

Questions about how GQ does work #326

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello! so some quick background, I am interested in SNP that are unique (= private) to each sample of my cohort. I ran DeepVariant on each sample, then GLnexus as per the best practices recommendations.

I extracted unique variants using bcftools --private option. Then, I wanted to do some filtering on GQ.

Here is the GQ distribution on one of the individual vcf file (so before joint-calling), I don't have much to say about it, it makes sense GQ

However, here is the GQ distribution of the --private SNPs GQ_private

First, there are some values that are NA due to SNPs that have . as GQ value. That's all right, errors in sequencing / mapping I guess. In fact, my reasoning is that the --private option will enrich the SNP set in all the errors that are unique to each sequencing data set. Therefore, I was expecting that the GQ distribution would be shifted to the left. However, what we see is that, not only is it shifted to the left but the shape of the distribution is also changed. So I would like to know more about how GQ is exactly computed by DeepVariant. And why does the GQ seems to abruptly peak at 11-12

I also read on your blog that you consider "high quality variants" as the ones with a GQ of 20. Of course, owing to the distribution of GQ for the private set, setting a GQ threshold at 20 will make a big difference, as seen on this plot GQ_range_subs_cropped_for_github

Thanks a lot for your insight!

ghost commented 4 years ago

To give you more perspective, here is the effect of filtering, with filters >= GQ (each dot represents one of my 10 samples) allGQ_private_crop_github

tedyun commented 4 years ago

Hi @aderzelle, thank you for your question. I believe this is coming from the cohort merging step (by GLnexus) assuming that you used --config DeepVariantWGS or --config DeepVariant for merging, not the DeepVariant itself. The merging step by GLnexus includes filtering alleles and revising genotypes and we have conducted an extensive study [1] on finding the best merging parameters for DeepVariant outputs, which was later pushed to the open-source GLnexus.

The current version of the merging parameters uses min_AQ1 = min_AQ2 = 10 (AQ means "allele quality") as you can see here https://github.com/dnanexus-rnd/GLnexus/blob/4d057dcf24b68b33de7a9759ae65ca2b144a3d47/src/cli_utils.cc#L874 The definitions of these parameters can be found at https://github.com/dnanexus-rnd/GLnexus/wiki/Configuration .

I believe this is the reason why you are seeing the sharp drop at GQ = 10 as the alleles corresponding to them were already filtered. If you'd like to merge DeepVariant gVCFs without any filtering or genotype revision, you can download the .yml file here https://gist.github.com/tedyun/1d4f57ca67fb18647b7b251f9e0b35c2 and use --config DeepVariant_nomod.yml instead when running GLnexus.

I hope this helps and please let us know if you have any more questions/comments.

[1] https://doi.org/10.1101/2020.02.10.942086

Best, Ted

ghost commented 4 years ago

Thanks for the very helpful answer. So, actually what I am seeing is the distribution for "private" seems to be centred on 10 and has been amputated of its left part by GLnexus. Interesting ...

From GLnexus definition

Thus we may have a lower quality threshold for alleles observed in multiple individuals, compared to singletons.

That means we should be more stringent on quality filtering for singletons since they are not supported by observations in more than one individual, right?

tedyun commented 4 years ago

We can be (that would imply min_AQ1 > min_AQ2), but based on our investigation the current settings were chosen to give the best overall quality for variety of cohort sizes and sequencing depths.

Depending on how the cohort VCF is used, one may want to apply additional filters to the cohort, which can either be applied after merging using a standard VCF modification tool (e.g. bcftools), or by changing the merging settings directly in the .yml file.

tedyun commented 4 years ago

FYI I submitted a pull request to GLnexus repo for the "nomod" preset for merging (no filters or genotype revision). https://github.com/dnanexus-rnd/GLnexus/pull/229 If this is accepted, you'll be able to try it out without downloading an external .yml file.

@aderzelle Please let me know if you have any questions/comments related to this issue. If not, please feel free to close it :)

ghost commented 4 years ago

FYI I submitted a pull request to GLnexus repo for the "nomod" preset for merging (no filters or genotype revision). dnanexus-rnd/GLnexus#229 If this is accepted, you'll be able to try it out without downloading an external .yml file.

@aderzelle Please let me know if you have any questions/comments related to this issue. If not, please feel free to close it :)

Thanks for your detail explanation! I think I have all I need ;)