qlu-lab / PUMAS

Fine-tuning polygenic risk score models using GWAS summary statistics
MIT License
41 stars 11 forks source link

PUMAS/PUMAS-ensemble

PUMAS and PUMAS-ensemble are summary-statistcis-based method to fine-tune, combine, and benchmark PRS methods using only GWAS summary statistics and a LD reference panel. If the PRS fine-tuning is the only task, please use PUMAS functions. Otherwise to achieve all three objectives, please use PUMAS-ensemble. A workflow of PUMAS/PUMAS-ensemble is shown below here

Announcements

Version History

Getting Started

GWAS summary statistics preparation

We highly recommend that users clean their summary statistics prior to applying PUMAS/PUMACUBS. Here we provide a GWAS sumstats QC script. Please make sure that the input GWAS sumstats has rsID for each SNP. To use the GWAS QC script, run:

Rscript ./code/gwas_qc.R \
--file_path <raw GWAS sumstats path> \ # required
--frq_path <frequency data path> \ # required
--output_path <output folder> \ # required
--snp <SNP column name> \ # required
--a1 <A1 column name> \ # required
--a2 <A2 column name> \ # required
--stat <BETA/OR column name> \ # required
--OR \ # use this flag if sumstats are reported as odds ratios
--logit \ # use this flag if the sumstats come from logistic regression, regardless of whether OR or beta is reported
--p <P column name> \ # required
--n.total <total sample size> \ # a number, required
--n.col <N/sample size column name> \ # see below for detailed instruction for sample size
--n.case <case sample size> \ # see below for detailed instruction for sample size
--n.con <control sample size> \ # see below for detailed instruction for sample size
--n.case.col <case sample size column name> \ # see below for detailed instruction for sample size
--n.control.col <control sample size column name> # see below for detailed instruction for sample size
--chr <CHR column name> \ # optional
--bp <BP column name> \ # optional
--se <SE column number> \ # optional (recommended to provide)
--maf <MAF column number> \ # optional (recommended to provide)

Sample size requirement

Sample size information can be often misspecified in reported GWAS summary statistics. Ideally GWAS sumstats contain per-SNP total sample size for linear regression association statistics and per-SNP case and control sample size for logistic regression association statistics. In practice, for linear and logistic summary statistics, users should provide one of the following sample size information respectively with priority shown below:

If the sumstats don't contain any per-SNP sample size information, this script will impute sample size and conduct QC based on imputed sample size. We follow sample size imputation introduced in Prive et al. (2022).

Using PUMAS

Subsample training and tuning summary statistics

For PUMAS/PUMAS-ensemble to subsample GWAS summary statistics from a full GWAS summary-level data, two datasets are requried:

Evaluate PRS performance

After partitioning summary statistics, users can train any PRS method using the subsampled training summary statistics. Then, to use PUMAS for evaluating and fine-tuning PRS methods, three datasets are required:

After gathering all necessary datasets, run:

Rscript ./code/PUMAS.evaluation.R \
--k <number of folds> \
--ref_path <LD ref> \
--trait_name <trait name> \
--prs_method <prs_method> \
--xty_path <subsampled sumstats folder> \
--stats_path <statistics folder> \
--weight_path <SNP weights> \
--output_path <output folder>

Using PUMAS-ensemble

Subsample training, tuning, ensemble training, and testing summary statistics

PUMAS-ensemble uses exactly the same inputs as PUMAS. The only difference between implementation between PUMAS and PUMAS-ensemble is scripting. To partition full GWAS summary statistics to four different subsets, run:

Rscript ./code/PUMAS-ensemble.subsampling.R \
--k <number of folds> \
--partitions <training>,<tuning>,<ensemble training>,<testing> \
--trait_name <trait name> \
--gwas_path <GWAS sumstats folder> \
--ld_path <ld folder> \
--output_path <output folder>

Construct ensemble PRS and benchmark PRS models

The required input datasets are mostly the same as PUMAS's PRS evaluation function. Different from PUMAS, PUMAS-ensemble requires SNP weights from each PRS method to be stored in a separate .<pre_method>.txt file so that PUMAS-ensemble can construct ensemble PRS based on fine-tuned PRS model from each method and benchmark all PRS models (again, please make sure that SNP weights files have exactly the same set of SNPs, A1, and A2 in the same order as subsampled summary statistics). After PRS model training, run:

Rscript ./code/PUMAS-ensemble.evaluation.R \
--k <number of folds> \
--ref_path <LD ref> \
--trait_name <trait name> \
--prs_method <prs_methods> \
--xty_path <subsampled sumstats folder> \
--stats_path <statistics folder> \
--weight_path <SNP weights> \
--output_path <output folder>

Output

PUMAS

Subsampling

PRS Fine-tuning

PUMAS-ensemble

Subsampling

Ensemble PRS construction and PRS benchmarking

Citation

Support

Please send questions and issues related to PUMAS/PUMAS-ensemble software to Zijie Zhao (zzhao232@wisc.edu) and Qiongshi Lu (qlu@biostat.wisc.edu).