privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Running LDpred2 automatic model with provided LD reference #193

Closed JingZhang1227 closed 3 years ago

JingZhang1227 commented 3 years ago

Hi Florian,

I am using LDpred2 to calculate polygenic risk scores. I am running the automatic model with the provided LD reference based on the example-with-provided-ldref.R.

I have three datasets: the summary statistics, the provided LD reference, and a test dataset (plink format). The positions of both the summary statistics and the test dataset are hg38.

I was wondering if I could ask a few questions about the example-with-provided-ldref.R.

  1. snp_match (line 21 of the example-with-provided-ldref.R ). Here I matched the SNPs between the summary statistics and the provided LD reference. I used the column “pos_hg38” in the provided LD reference to match the position, but I have question about a0 and a1. In the summary statistics, instead of a0 and a1, it has A1 and A2. From the description of the summary statistics, A1 is the reference allele for OR (may or may not be minor allele), and A2 is the alternative allele. It also says the OR is defined as odds ratio for the effect of the A1 allele. I used A1 as a1, and A2 as a0. I was wondering if you would agree. I also calculated the log(OR) and used it as “beta”.
  2. in_test (line 47). I brought in my test dataset at this step by using snp_readBed() and then obj.bigSNP <- snp_attach(). According to the line 47, I should only match the test dataset to df_beta by “chr” and “pos”. But I was wondering if the alleles should also be matched here. After running the obj.bigSNP <- snp_attach(), I can see the alleles in the test dataset are called allele1 and allele2. I was wondering which allele should be a0.
  3. pred_auto (line 89). My understanding is that here I should use “beta_auto” to get polygenic risk scores for individuals in my test dataset. I don’t think I have access to the “UKBB_imp_HM3.rds”, but I used genotypes from my test dataset as G, and ran the following code:

ind.test <- 1:nrow(G) pred_auto <- big_prodMat(G, beta_auto,ind.row = ind.test, ind.col = df_beta[["_NUM_ID_"]], ncores = NCORES)

I assume the UKBB_imp_HM3.rds is used to calculate the provided LD reference. Because I used the G from my test dataset and it is not matched to df_beta by a0 and a1 (only by chr and pos), I was wondering if it will cause problem here.

I really appreciate any suggestions!

privefl commented 3 years ago
  1. For the alleles it really depends, sometimes you don't know which one is which. What you need to match is between your sumstats and your test data (the correlation matrix is invariant if you switch these two), so it is your job to make sure these are the same. Basically, if you get polygenic scores with a negative predictive value (cor < 0 or AUC < 0.5) for all, then it is a good sign that should switch the alleles.

  2. if you want to also match by the alleles, you can probably use snp_match() in this step.

  3. yes you should use your test set as G, and can use it to perform the QC on the scales (as you do not use the phenotype in the QC).

JingZhang1227 commented 3 years ago

Thank you very much for getting back to me! I will match the alleles in my summary statistics and test dataset carefully. Sorry I didn't fully understand the QC part of the answer (in the third point). Would you mind elaborate a bit further on this?

Thanks again!

privefl commented 3 years ago

I was talking about https://github.com/privefl/paper-ldpred2/blob/master/code/example-with-provided-ldref.R#L106-L109.

JingZhang1227 commented 3 years ago

Got it! Thank you very much!