PRS: one data for "discovery", and another data for "derivation"

Hi, guys:

I still have some "design" questions on PRS. So, we start with a summary statistics file from a GWAS, called "discovery". Then, we use whatever methods (such as P+T or beta shrinkage) to pick some number of SNPs plus their "original" or "shrunk" effect sizes to score individuals on a DIFFERENT set of samples. I highlight the word "DIFFERENT" here, because we usually don't run a height GWAS on a data set such as UKB and then construct a height PRS for the same individuals in UKB. Otherwise, there would be too much inflation.

However, if I want to see how height PRS is associated with a DIFFERENT phenotype such as CVD risk, then it is OK to run GWAS and calculate PRS on the same UKB cohort, correct? Either DIFFERENT phenotype, or DIFFERENT individuals, this is my understanding. Don't know I got this right. Imagine, one day we have one single Biobank on all ~7 billion people on earth, only a single dataset. I guess there is no need to arbitrarily separate this cohort into "discovery" and "derivation + validation" subsets in order for me to study whether height PRS is associated with CVD risk, correct?

Below is an introduction paragraph that I am writing for a teaching material. I would deeply appreciate if someone could feedback on this, to make sure that I did not say something completely wrong.

Usually, single nucleotide variants (SNV) were first “discovered” through GWAS approach, together with each SNV’s summary statistics including reference allele, effect size, and P-value. Using these discovered SNVs as reference, PRS was “derived” and “validated” by varying selection criteria (such as clumping SNVs by linkage disequilibrium (LD) or beta shrinkage methods). Finally, the PRS was “tested” for its association with traits or disease outcomes. When PRS needs to be evaluated for its validity and predictive power on the same phenotype, two distinct samples are needed for “derivation” and “validation” separately. However, when PRS is used to study how one phenotype relates to another phenotype genetically, “derivation” and “validation” could be performed on the same cohort. Given the large sample size and deep genotyping and rich phenotyping available in large biobanks such as UK Biobank (UKB), we hypothesize that a PRS-to-Prediction process could be conducted within a single large biobank alone.

Thank you very much & best regards, Jie

privefl / bigsnpr

PRS: one data for "discovery", and another data for "derivation" #128