pjgreer / ukb-rap-tools

Scripts and workflows for use analyzing UK Biobank data from the DNANexus Research Analysis Platform
45 stars 9 forks source link

about 03-GTprep-ldprune.sh and whether do QC for regenie step2 #16

Closed iamyingzhou closed 11 months ago

iamyingzhou commented 1 year ago

Dear pjgreer, I am working on WES GWAS and rare variants analysis. I've been comparing your scripts and the ones from dnanexus/UKB_RAP and I've noticed some differences. I'm uncertain if the '03-GTprep-ldprune.sh' script is a necessary component of the workflow. Additionally, I'm considering whether to apply quality control in step2 of regenie analysis.

Thanks!

pjgreer commented 1 year ago

I implemented the ldprune step to reduce the number of SNPs prior to liftover. Liftover is a single threaded application and reducing the number of input snps from 800K+ to 400k+ cuts the compute time and price in half. Liftover must be done for step1 when running step2 on WES or on the TOPMED imputed data sets. The UKB_RAP repo ignored this step and simply deferred to a separate wdl tool that for liftover. I wanted a workflow that is covered all the steps via dx-run.

If you are running rare variant analysis and not GWAS, then I agree that the QC filter for the WES data is unnecessary. Most of the variants in the WES are actually very rare (MAF<0.001) and these rare variants tend to fail GWAS association tests.

iamyingzhou commented 1 year ago

Dear pjgreer,

Thank you for your insightful response. I have indeed observed the discrepancy you mentioned as I've been using your scripts for liftover operations. Unlike your approach, I didn't implement ldprune and, as a result, the process took an extensive amount of time—over two days and twelve hours—before it was halted due to reaching the maximum disk capacity of 1TB. I noticed that CPU usage during the instance run was relatively low, around 16%. I'm contemplating the idea of not merging all chromosomes initially and instead performing parallel liftover conversions for each chromosome before combining them. I believe this could potentially expedite the runtime.

Additionally, I would like to inquire, if I were to conduct GWAS analysis and choose not to perform QC on the imputed genomic files, what would be the potential consequences? Would this result in the inclusion of SNPs with exceptionally small MAF in the findings? However, since my genotype files have been quality controlled, wouldn't that prevent such occurrences? My query stems from reviewing the regenie documentation for UK Biobank GWAS analysis(https://rgcgithub.github.io/regenie/recommendations/), where the provided code does not include QC steps for imputed genomic files.

Best regards

pjgreer commented 1 year ago

Honestly, I have not tested those options in regenie step 2. I do know that you should not perform the allele frequency based filtering if you are running rare variant testing in step 2. I use plink for most of my analyses after finding saige, plink and regenie all produce comparable GWAS results. Fro plink you should filter out the missing and very rare variants. perhaps regenie does the same thing behind the curtain, I really do not know. I know it doesn't really hurt for GWAS, but should not be done for rare variant analysis.

iamyingzhou commented 1 year ago

Got it, much appreciated!