privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
184 stars 44 forks source link

5 questions to LDpred2 #106

Closed jielab closed 4 years ago

jielab commented 4 years ago

Hi, there:

I got a few theoretical and technical questions, listed below. It would be great if you guys could shed some light.

  1. Figure 1 of the paper Sekar’s 2018 paper https://www.nature.com/articles/s41588-018-0183-z shows derivation, validation, and testing. Long time ago, I read Shaun Purcell’s 2009 Nature paper https://www.nature.com/articles/nature08185. That paper used discovery sample and target sample. I now know that there is a validation step. I guess this is where the concept “summary statistics” vs. ”individual data” come from and where my confusion arose. I recently also read a similar PRS flowchart at https://www.nature.com/articles/s41581-018-0067-6/figures/1. Somehow, this Figure has an extra step “optimization”. Your paper also used the term “optimize” in “the value that optimizes prediction accuracy can then be determined in an independent validation dataset.” So, which step does this “optimize” belong to and what exactly does it do?

  2. So, you see, all these terms could cause confusion, especially when there are more, such as like “training”, “another independent validation”, “base data”, “test data”, “testing data”, etc. For your LDpred2 paper, it would be really helpful to add a Box to explain all these terms, so that users don’t get confused.

  3. For the LDpred2 genotype file and summary statistics file that I downloaded from https://privefl.github.io/bigsnpr/articles/LDpred2.html, I found that both the genotype map file and the sumstats file have 130,816 records. I though that the genotype file is from 1000G and it usually has ~500 samples for LD reference purpose. So, the genotype file for LDpred2 is actually the “individual level” genotype that was used to generate the summary statistics?

  4. The LDpred2 BioRxiv paper says that “LDpred2-auto performs particularly well, provided some quality control is performed on the summary statistics” , but on Figure 1, “ldpred2_auto” does not really perform that well for the HLA data. Will this be improved so that users the “auto” version always performs the best?

  5. Although this is named LDpred2, the way to run it is completely different from running LDpred, from users’ perspective. So, will you guys stick to the bigsnpr approach or might one day enable Linux commands such as “ldpred2 coord”? Also, how about LDpred-funct? It would be a lot of work for users to run 3 different LDpred programs, and there are also other programs such as SBayesR and LASSOSUM to evaluate.

Your feedback to my above questions would be really appreciated!

Thank you & best regards, Jie

privefl commented 4 years ago
  1. I think discovery/derivation would correspond to performing GWAS to get summary statistics. Usually, you get that from external published studies from which you don't have access to individual-level data. Optimization/validation is the step where you choose the best-performing hyper-parameters of the methods. Then you test the chosen model in an independent set. Validation and testing require individual-level data.

  2. These terms are pretty common. It is like if you ask me to define what is a GWAS when I use summary statistics. We can't reexplain everything all the time. Do you have a supervisor to whom ask these questions?

  3. 130K records are variants, not individuals. For the tuto, these are simulated data from 1000G data. I think I provide the code used to generate this data.

  4. These HLA cases are extremely hard cases where very large correlated effects are in the same correlated regions, and no effect at all is present in all other chromosomes but chr6. This is an extreme case that should represent autoimmune diseases, such as T1D. But actually T1D seems easier and LDpred2 works well for T1D.

  5. There won't be any python version for LDpred2 I believe. We are not LDpred-funct developers, so you should ask its author about this. Yes, there are lots of methods out there, and it takes a lot of time to try them all.

jielab commented 4 years ago

Dear Florian:

Thank you very much!

For "We can't reexplain everything all the time. Do you have a supervisor to whom ask these questions", I certainly don't expect you guys to explain everything. I am actually a "supervisor" myself, and I am trying to teach my graduate students what LDpred2 is and how to run it.

I simply hope that LDpred2 is not only better, faster, stronger, but also explain things more clearly.

For example, in your previous response, you used terms such as "discovery/derivation" and "Optimization/validation". Do you mean "optimization" the same as "validation"? If I generate 10 PRS and pick the best one based on the highest R2 in validation dataset, I feel we might find a different best R2 if we use another validation dataset. So, is this process done manually and how many "optimization/validation" we need to do before we call it "that is enough". Also, in which step does "summary statistics" vs. "individual level data" come from, since all programs need GWAS summary statistics as "discovery/derivation" and individual level data for "testing". I feel these are valid and reasonable questions.

I am simply hoping that you guys could make and share a PRS architecture plot like the following one, but with a bit more details to explain better what is exactly done at each step. The following plot is confusing to me too. For example, a "best GPS" is already displayed before the "validation" step. And I am not sure what the last step "additional predictors" is doing.

捕获

I hope other users and potential users of LDpred2 might also like a plot as I just mentioned, so that we know exactly what LDpred2 is, not just a few plots comparing its performance with other tools to show that LDpred2 is performing better.

Thank you & Bet regards, jie

privefl commented 4 years ago

The standard design of most machine-learning algorithms are:

Normally, ML methods use individual-level data (matrices with individuals x variables); PRS methods based on summary statistics are just a bit different. But not all PRS methods are based on summary statistics. See e.g. my penalized regressions implementation that is just another ML method based on individual-level data (https://www.genetics.org/content/212/1/65). Sumstats-based methods are different in the sense that they decouple the learning of effects from the handling of correlation between variables. In these cases, (independent) effects come from an external GWAS. And then you have to use an LD/correlation matrix to account for LD between your variants/variables.

jielab commented 4 years ago

I just found the following figure, from 2020 Circulation research paper "Polygenic Scores to Assess Atherosclerotic Cardiovascular Disease Risk". It helped me a lot to understand the terminologies and processes.

Figure_2

privefl commented 4 years ago

Good. Sometimes, what is called "Training" in the figure is called Validation elsewhere, and what is called validation is called testing.

jielab commented 4 years ago

exactly. that will be where confusion arises.

So, both LDpred and LDpred2 are in the category of "beta shrinkage", in contrast to LD clumping?

LDpred is also one kind of Bayesian methods. As written in the original LDpred paper "we propose LDpred, a Bayesian PRS that estimates posterior mean causal effect sizes from GWAS summary statistics by assuming a prior for the genetic architecture and LD information from a reference panel." Bayesian is usually in contrast to Frequentist. Don't know which "beta shrinkage" method is frequentist instead of Bayesian. The most difficult part for Bayesian is "a prior". Don't know if it could be easily explained how LDpred "assumes a prior for the genetic architecture and LD information from a reference panel".

Also, the second sentence in the LDpred2 BioRxiv paper says that "LDpred is a popular and powerful method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants". I thought that all other PRS methods are also "based on summary statistics and a matrix of correlation between genetic variants", correct? This is my other confusion, "summary statistics" vs. "individual level data".

It would be great if you guys are willing to put a plot like the one that I show to clarify all these key terminologies and differences. As you see, the two plots that I pasted in my previous posts are already inconsistent in terms of terminologies. I understand that a new software is mainly about functionality and performance. But I feel that it is also very important for users to fully understand the core difference and the flowchart of LDpred2.

Best regards, Jie

privefl commented 4 years ago
jielab commented 4 years ago

Dear Florian:

Thank you very much!

The Nature Protocol paper is a great one for me to read :-)

Best regards, Jie

privefl commented 3 years ago

Thank you for using LDpred2. Please note that we now recommend running LDpred2 genome-wide instead of per chromosome. The paper (preprint) and tutorial have been updated.