mrcepid-rap / mrcepid-filterbcf

Filter VCF/BCF according to MRCEpid parameters
MIT License
0 stars 0 forks source link

Consequences of not applying any filter on the alt/alt homozygotes #2

Open dianacornejo opened 1 year ago

dianacornejo commented 1 year ago

Hi @eugenegardner just wondering what the consequences of not applying any filter for the ALT/ALT genotypes would be. Very likely including false positive calls? Also do you have any idea why is this difference in the GQ depending on the genotype? and has this been discussed by the UKB people? Just wondering what the best approach to do this is?

Thank you

Diana

eugenegardner commented 1 year ago

I think there are two primary consequences:

  1. Sites with high levels of erroneous genotypes would be retained, hence why I soft-filter on 50% missingness in this pipeline after hard-filtering individual genotypes. The distribution is predominantly binary (i.e., most sites are either really bad or really good).
  2. Individual genotypes in otherwise high-quality sites would retained, leading to a relatively small number of false-positive genotypes for individual sites.

As for the GQ issue with homozygous alt calls – I have no idea! I assume it is something to do with how the GQ field was calibrated when running DeepVariant, but I am unsure. I posted the issue originally on the old UKBB forums where it was confirmed, but as far as I am aware, nothing was ever done about it. In general, homozygous alt calls are of high quality due to how the genotyper works – the majority of issues are for heterozygous genotypes.

In general I think you can follow the filtering approaches outlined in this applet and have relatively high-quality data.

dianacornejo commented 1 year ago

@eugenegardner thanks for the response! I'm doing a filtering using a very similar approach on what you did. Also a follow up question. Do you know if the AAscore is only available for the WGS or if there's something similar for the WES? The authors mention it in this paper

eugenegardner commented 1 year ago

AAscore is just a variant quality value similar to something like VQSR from GATK. The team that generated the WES data used Google DeepVariant according to their best practices. I am unfamiliar with exactly how filtering is done by DeepVariant (and what quality scores it does / does not generate), but at any rate the only score included in the pVCFs released to users of UKBiobank is AQ. I am not familiar with how AQ is calculated.