Closed ghost closed 4 years ago
To give you more perspective, here is the effect of filtering, with filters >= GQ (each dot represents one of my 10 samples)
Hi @aderzelle, thank you for your question. I believe this is coming from the cohort merging step (by GLnexus) assuming that you used --config DeepVariantWGS
or --config DeepVariant
for merging, not the DeepVariant itself. The merging step by GLnexus includes filtering alleles and revising genotypes and we have conducted an extensive study [1] on finding the best merging parameters for DeepVariant outputs, which was later pushed to the open-source GLnexus.
The current version of the merging parameters uses min_AQ1 = min_AQ2 = 10
(AQ means "allele quality") as you can see here https://github.com/dnanexus-rnd/GLnexus/blob/4d057dcf24b68b33de7a9759ae65ca2b144a3d47/src/cli_utils.cc#L874 The definitions of these parameters can be found at https://github.com/dnanexus-rnd/GLnexus/wiki/Configuration .
I believe this is the reason why you are seeing the sharp drop at GQ = 10
as the alleles corresponding to them were already filtered. If you'd like to merge DeepVariant gVCFs without any filtering or genotype revision, you can download the .yml
file here https://gist.github.com/tedyun/1d4f57ca67fb18647b7b251f9e0b35c2 and use --config DeepVariant_nomod.yml
instead when running GLnexus.
I hope this helps and please let us know if you have any more questions/comments.
[1] https://doi.org/10.1101/2020.02.10.942086
Best, Ted
Thanks for the very helpful answer. So, actually what I am seeing is the distribution for "private" seems to be centred on 10 and has been amputated of its left part by GLnexus. Interesting ...
From GLnexus definition
Thus we may have a lower quality threshold for alleles observed in multiple individuals, compared to singletons.
That means we should be more stringent on quality filtering for singletons since they are not supported by observations in more than one individual, right?
We can be (that would imply min_AQ1 > min_AQ2
), but based on our investigation the current settings were chosen to give the best overall quality for variety of cohort sizes and sequencing depths.
Depending on how the cohort VCF is used, one may want to apply additional filters to the cohort, which can either be applied after merging using a standard VCF modification tool (e.g. bcftools), or by changing the merging settings directly in the .yml file.
FYI I submitted a pull request to GLnexus repo for the "nomod" preset for merging (no filters or genotype revision). https://github.com/dnanexus-rnd/GLnexus/pull/229 If this is accepted, you'll be able to try it out without downloading an external .yml file.
@aderzelle Please let me know if you have any questions/comments related to this issue. If not, please feel free to close it :)
FYI I submitted a pull request to GLnexus repo for the "nomod" preset for merging (no filters or genotype revision). dnanexus-rnd/GLnexus#229 If this is accepted, you'll be able to try it out without downloading an external .yml file.
@aderzelle Please let me know if you have any questions/comments related to this issue. If not, please feel free to close it :)
Thanks for your detail explanation! I think I have all I need ;)
Hello! so some quick background, I am interested in SNP that are unique (= private) to each sample of my cohort. I ran DeepVariant on each sample, then GLnexus as per the best practices recommendations.
I extracted unique variants using bcftools --private option. Then, I wanted to do some filtering on GQ.
Here is the GQ distribution on one of the individual vcf file (so before joint-calling), I don't have much to say about it, it makes sense
However, here is the GQ distribution of the --private SNPs
First, there are some values that are NA due to SNPs that have . as GQ value. That's all right, errors in sequencing / mapping I guess. In fact, my reasoning is that the --private option will enrich the SNP set in all the errors that are unique to each sequencing data set. Therefore, I was expecting that the GQ distribution would be shifted to the left. However, what we see is that, not only is it shifted to the left but the shape of the distribution is also changed. So I would like to know more about how GQ is exactly computed by DeepVariant. And why does the GQ seems to abruptly peak at 11-12
I also read on your blog that you consider "high quality variants" as the ones with a GQ of 20. Of course, owing to the distribution of GQ for the private set, setting a GQ threshold at 20 will make a big difference, as seen on this plot
Thanks a lot for your insight!