pjgreer / ukb-rap-tools

Scripts and workflows for use analyzing UK Biobank data from the DNANexus Research Analysis Platform
37 stars 8 forks source link

How do we apply this proces #1

Closed TrumanZYX closed 1 year ago

TrumanZYX commented 1 year ago

Thanks for your sharing. But your code use the dataset is Thanks

pjgreer commented 1 year ago

These scripts use a combination data sources including V2, (genotype grch37) V3 (imputed grch37) and Whole Exome (grch38).

The V2 (grch37 non-imputed) genotype data is only used by REGENIE to build an initial model that captures ancestry, relationship, and some phenotype information into the model. The second step of the REGNIE process uses the V3 (grch37 imputed) data OR the Whole Exome Data (grch38) for the regression model conditional on the initial step 1 model. The whole point of the initial step 1 model is to model out confounding variables like ancestry and genetic relationships. You therefore want to use a smaller, genome-wide dataset with ~400K markers. The WES data is inappropriate for step 1 because it is not genome-wide and V3 dataset is overkill (>30 million markers) for the initial REGENIE model.

Please see the full regenie documentation here: https://rgcgithub.github.io/regenie/overview/

If you are analyzing the V3 (imputed dataset, you can perform GWAS using either REGENIE or PLINK2. The simplest analysis is to use PLINK2 and just run the the scripts in gwas_impute37_plink. This will only use the v3 data. If you choose to use REGENIE, you must first run the scripts in GTfile_prep using the V2 data, then run the scripts in gwas_impute37_regenie on the v3 data.

If you want to run GWAS on the WES data, the simplest way is to just run the plink workflow here: gwas_wes38_plink. Using REGENIE requires the most steps. 1: GTfile_prep, 2. GTfile_liftover and finally 3. gwas_wes38_regenie. You must liftover the v2 data from grch37 to grch38 so that the model blocks can match between the V2 data and the WES data.

I hope this helps.