which summary statistics?

jielab commented 4 years ago

Hi, guys:

I suddenly get confused with the word “summary statistics”.

Let’s say that I want to base on the GIANT Height GWAS summary statistics to generate a PRS for all UKB individuals. All UKB individuals have individual level genotype data. Let’s say that I also generated two GWAS for UKB, say, (1) Height; (2) CVD.

This page https://privefl.github.io/bigsnpr/index.html says that “There are 3 main methods currently available” for PRS: • Penalized regressions with individual-level data (see paper and tutorial) • Clumping and Thresholding (C+T) and Stacked C+T (SCT) with summary statistics and individual level data (see paper and tutorial). • LDpred2 with summary statistics (see preprint and tutorial)

I think “summary statistics” in the above means my two UKB GWAS (Height, CVD), not the GIANT GWAS summary statistics, correct? After all, the GIANT GWAS summary statistics files are required by all PRS methods, because that is the reference, correct?

I will be very impressed if LDpred2 can perform a PRS analysis even without UKB individual genetic data. I have been wondering how a R package could deal with the UKB individual genetic data for ~500,000 samples.

But what if I do want to generate a PRS for all ~500,000 individuals in UKB? Can LDpred2 still do that?

Can someone please kindly clarify on this?

Thank you very much & best regards, Jie

privefl commented 4 years ago

Summary statistics are the GWAS summary statistics, e.g. beta, beta_se, p_val for all variants. Individual-level data are the genotype matrix where you have the allele counts for all variants for all individuals (e.g. UKBB data). You can always perform a GWAS on the individual-level data you have in order to use methods based on summary statistics.

jielab commented 4 years ago

Thanks, Florian!

Surely I know that summary statistics are beta, beta_se, etc., while individual level data are sample_ID, genotype, phenotype, etc.

For the example that I gave, it seems "summary statistics" means the GWAS summary statistics of GIANT Height GWAS. The individual-level data are UKB genotype and phenotype data. It seems to me that all PRS methods would need these two pieces of information, besides LD matrix that is needed.

My question is, for the "3 main methods currently available” for PRS, listed here https://privefl.github.io/bigsnpr/index.html, why the first method says "individual-level data" and the third method says "summary statistics" while the second one says both.

Can the first method run without GIANT GWAS summary statistics, or can the third method run without UKB individual-level data?

Best regards, Jie

privefl commented 4 years ago

1/ The first method is a penalized regression using only the individual-level matrix and phenotype. 2/ SCT is deriving C+T scores (based on sumstats) and then train a penalized regression using these scores (indiv). 3/ LDpred2 is using sumstats only (and possibly using also indiv data for choosing best-performing parameters, when not using LDpred2-auto).

I invite you to look at the papers and tutorials to understand better which input data is needed.

jielab commented 4 years ago

Thanks again, Florian!

Let's put aside "penalized regression" for now, I simply want to know: aren't both GIANT height GWAS summary statistics and UKB individual genotype data are required, for me to generate a height PRS for each UKB participant? You just brought a new term "individual-leve matrix"....

Yes, I did intend to learn all these things from running the commands of each of these softwares (PRSice, LDpred, RbayesR, LASSOSUM), unfortunately they all require different software and packages and dependencies and sometimes i could not even install them all and run the example.

It would be nice for you guys to provide a flowchart in case people like me really want to know the basic principal on how it works without digging into the complicated stats code or computer code.

Best regadrs, Jie

bvilhjal commented 4 years ago

Hi Jie.

the data referred to in the three methods is the type of training data required. You always need individual-level data for the prediction.

Best, Bjarni

jielab commented 4 years ago

Dear Bjarni:

Thank you very much. I now kind of understand.

All PRS methods need the following two: (1). a DISCOVERY cohort GWAS summary statistics file such as GIANT.height.gwas.txt. (2). a TARGET cohort individual level data, both genotypic (UKB genotype data) and phenotypic (UKB height or some other diseases traits to test the associations). This is called "prediction phase".

Since DISCOVERY cohort and TARGET cohort are somewhat different (although both with the same ethnicity), in the case that the DISCOVERY cohort individual level genotype data is not available, a LD matrix such as that from 1000G project is needed to "prune" or "shrink" the original BETA values in GIANT.height.gwas.txt.

Then in between (1) and (2) mentioned above, there is a "training phase", which aims to find out the best PRS. During this training phase, some methods might need the GIANT "individual level" genotype data, while others might only need the GIANT GWAS "summary statistics". And this is where the different methods differ in terms of data requirement and underlying algorithm.

Can you please kindly let me know if I got this correct, this time, as stated above?

BTW, now there are LDpred, LDpred-funct, LDpred2. Should I simply use LDpred2 since it is the newest one and it is supposed to perform better than the other two? One issue I have with LDpred2 is that it is written in R. I can't imagine that it could process a PRS based on millions of SNPs.

Best regards, Jie

privefl commented 4 years ago

You should really read the papers.

privefl / bigsnpr

which summary statistics? #100