privefl / paper-ldpred2

Paper discribing LDpred2
https://doi.org/10.1093/bioinformatics/btaa1029
16 stars 12 forks source link

Correlation ldref example vs main tutorial/vignette #10

Closed kathrynfreeman closed 1 year ago

kathrynfreeman commented 2 years ago

Hi Dr. Privé,

I just wanted to make sure the correlation here from the example with provided ldref:

tmp <- tempfile(tmpdir = "tmp-data")

for (chr in 1:22) {

  cat(chr, ".. ", sep = "")

  ## indices in 'df_beta'
  ind.chr <- which(df_beta$chr == chr)
  ## indices in 'map_ldref'
  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]
  ## indices in 'corr_chr'
  ind.chr3 <- match(ind.chr2, which(map_ldref$chr == chr))

  corr_chr <- readRDS(paste0("ldref/LD_with_blocks_chr", chr, ".rds"))[ind.chr3, ind.chr3]

  if (chr == 1) {
    corr <- as_SFBM(corr_chr, tmp)
  } else {
    corr$add_columns(corr_chr, nrow(corr))
  }
}

can be used in place of the below correlation code from the tutorial (when my dataset has less than 2000 individuals) to calculate ldsc and all the proceeding required to get pred_auto.

I use code similar to the ldref example code until the h2_est <- ldsc[["h2"]] step and then compute LDpred2-auto using the code from the polygenic scoring vignette/tutorial.

POS2 <- obj.bigSNP$map$genetic.dist`

tmp <- tempfile(tmpdir = "tmp-data")

for (chr in 1:22) {

  # print(chr)

  ## indices in 'df_beta'
  ind.chr <- which(df_beta$chr == chr)
  ## indices in 'G'
  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]

  corr0 <- snp_cor(G, ind.col = ind.chr2, size = 3 / 1000,
                   infos.pos = POS2[ind.chr2], ncores = NCORES)

  if (chr == 1) {
    ld <- Matrix::colSums(corr0^2)
    corr <- as_SFBM(corr0, tmp, compact = TRUE)
  } else {
    ld <- c(ld, Matrix::colSums(corr0^2))
    corr$add_columns(corr0, nrow(corr))
  }
}

If a dataset is barely over 2000 individuals is it harmful to still use the provided ldref example code pipeline until the h2_est <- ldsc[["h2"]] step?

Thank you in advance, Kate

privefl commented 2 years ago

The provided LD ref has many advantages, and you should prefer using it.

The main reason why you should not use it is when you do not have imputed data and the overlap between your test data and the LD ref is not great.

The second main reason is when you're building PGS using GWAS from another ancestry.

kathrynfreeman commented 2 years ago

Thank you for your response! I'm so sorry for my confusion.

Once I use the ldref correlation code, can I just continue with the code in the tutorial for LDpred2-auto? It appears to run properly but I want to ensure I am calculating the right thing. Does the code in the example with ldref also deal with summary stats QC or does that need to be done separately?

Additionally, is the number of variants used to calculate the pred score represented by the number of observations within df_beta?

Thank you so much again, Kate

privefl commented 2 years ago

Sumstats QC needs to be performed before.

Yes, all variants you have as input (in df_beta and corr) will be used.