privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Predicting phenotype after running LDpred2 #497

Closed shachideshpande closed 1 month ago

shachideshpande commented 1 month ago

Hi,

I have the following setup to compute PRS

Now, my question is: The PRS prediction on the test dataset obtained using the big_prodMat function gives me r^2 values that are close to the h2 parameter that I used in the simulation. However, when I compare the PRS prediction with the true phenotype, I see that the predicted PRS is hugely different in magnitude from the true phenotypic value.

For example, this is what I get

True phenotype Y [initial 10 values] 1.82708859 -1.51552560 -0.03102013 -0.90431992 0.42684915 2.31507040 -1.97493223 0.57433565 0.08778609 1.19800993

Predicted Y = big_prodMat(genotype_matrix, best_beta, ind.col=test_indices) [initial 10 values] 106.91172 23.75843 13.39448 39.01416 -12.47868 29.39887 -10.73250 95.53899 -42.83390 100.86083

Are we supposed to compute a coefficient of scaling and add the intercept obtained from lm(true Y ~ predicted Y) when we compute the PRS based outcome? None of the PRS tutorials seem to show that we scale and translate the predicted PRS outcome using lm(true Y ~ predicted Y). However, if we don't perform the scaling, it is hard for me to compute well-calibrated PRS uncertainty envelopes using the return_sampling_betas=True argumenet in snp_ldpred2_grid function.

privefl commented 1 month ago

Yes, there is no guarantee that the PGS is calibrated. Though, I would not expect such a large difference in magnitude, and usually would expect the PGS to have less magnitude than the phenotype..

What is optimal h2/p you used? How do they compare to the simulated values?

You can definitely use the validation set to calibrate the PGS.

shachideshpande commented 1 month ago

Thank you so much for the response!

Yes, there is no guarantee that the PGS is calibrated. However, I would not expect such a large difference in magnitude, and usually would expect the PGS to have less magnitude than the phenotype..

That makes sense. I wonder if my betas from the summary statistics are off. I am using the following code structure to compute the betas. simu <- snp_simuPheno(G_imputed, h2, M, K=NULL, ncores = NCORES) gwas <- big_univLinReg(G_imputed, simu$pheno[train_indices], ind.train=train_indices, ncores = NCORES)

What is optimal h2/p you used? How do they compare to the simulated values?

An example setup is the following:

Genotype matrix: Chr22, parameters h2=0.25, p=0.01. The h2/p values selected using the validation dataset are 0.137 and 0.56 respectively. This happens as snp_ldpred2_grid returns NA values for almost 80-90% of values on the (h2/p) grid I provide (including h2/p values close to true values in simulation). Also, when I try to run snp_ldpred2_auto to select the right h2/p values, most of the returned h2_est/p_est values are NA.

Here, the r^2 of true_phenotype with product(genotype, true_betas) is 0.25 (as expected) and lm(true_phenotype ~ product(genotype, true_beta)) gives me a slope of 0.999, intercept of -0.6773.

Now, the r^2 of true_phenotype with product(genotype, summary_stats_betas) is 0.09 and lm(true_phenotype ~ product(genotype, summary_stats_beta)) gives me a slope of 0.0066, intercept of -0.144. This corresponds to the predicted Y values I noted in my original question (sorry for the confusion, I meant summary_stats_beta in my original question).

Finally, the r^2 of true_phenotype with product(genotype, ldpred2_betas) is 0.18 and lm(true_phenotype ~ product(genotype, ldpred2_beta)) gives me a slope of 0.67, intercept of 0.44.

So my question is: should I scale my prs prediction and output 0.67*product(genotype, ldpred2_betas)+0.44 so that my uncertainty envelope is well-calibrated? Otherwise, do you think that NA values in selecting the h2/p parameters is suspicious?

I have another additional question on computing the LD matrices for LDpred2.

I use the following code structure to compute LD from an independent set of 2000 individuals in the UKBB dataset (independent of the GWAS training/validation/test dataset). CHR_LIST <- as.integer(G$map$chromosome) POS <- G$map$physical.pos

POS_GEN <- snp_asGeneticPos(CHR_LIST, POS, dir = "filepath", ncores = NCORES)

corr0 <- snp_cor(G_imputed, size = 3 / 1000, infos.pos = POS_GEN, ind.row = ld_indices[1:2000], ncores = NCORES) ld <- Matrix::colSums(corr0^2) corr <- as_SFBM(corr0, backingfile = sfbm_filepath, compact = TRUE)

This code seems to run successfully for chromosomes 13-22, but throws the following error for Chr1-12 when evaluating the last line that casts corr0 to SFBM.

`In (function (Gna, ind.row = rows_along(Gna), ind.col = cols_along(Gna), : integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

caught segfault address 0x7f5d28000000, cause 'memory not mapped'

Traceback: 1: (if (symmetric) range_col_sym else range_col)(spmat@p, spmat@i) 2: .Object$initialize(...) 3: initialize(value, ...) 4: initialize(value, ...) 5: methods::new("SFBM_compact", spmat = spmat, backingfile = backingfile) 6: as_SFBM(corr0, backingfile = sfbm_file, compact = TRUE) An irrecoverable exception occurred. R is aborting now ... /var/spool/slurmd/job3654315/slurm_script: line 13: 1083967 Segmentation fault (core dumped) `

privefl commented 1 month ago

I see, I guess the cause of both of your problems is that you're using way too many variants. For now, this is not recommended in LDpred2. Please restrict to either HapMap3 (or HapMap3+ if you have a good power in your GWAS). Also, you can use the pre-computed LD matrices for these two sets of variants, which are better than those you would compute by yourself, because it also adds some blocks which are good for the convergence of LDpred2.

Have a look at some previous reported issues here: https://github.com/privefl/bigsnpr/issues?q=is%3Aissue+integer+overflow+in+%27cumsum%27 for snp_cor() with too many variants.

shachideshpande commented 1 month ago

I understand, thank you! I have downloaded the LD matrices from the above link and will see how this works for me.

In your LDpred2 paper, did you also restrict your causal indices in simulations to HM3+ variants? My QC-ed genotype data contains about 6 million variants and I am concerned that my final r^2 value might degrade if I only use 25% of these SNPs to run LDpred2. It could work if I restrict the causal indices in my simulations to the ones available on HM3+. But would this be a problem when we run experiments with real phenotypes?

To restrict myself to SNPs from HM3+, I could use PLINK to create a subset of SNP positions common between my UKBB dataset and HM3+. I will try this and see if the speed and quality of results improve.

privefl commented 1 month ago
shachideshpande commented 1 month ago

I tried using the HM3+ subset of my variants and could run the snp_ldpred2_grid as well as snp_ldpred2_auto functions without too many NA outcomes. However, the h2_est and p_est values in the auto outcome are not always close to the true h2/p values. Is this normal?

lm(true_phenotype ~ product(genotype, predicted_betas)) now gives me a slope between 0.5 to 1.5, and it gets closer to 1 when my h2 increases + p decreases.

Also, when I compute the final PRS outcome using big_prodMat() function on the imputed genotype matrix, I get out-of-memory error on the cluster and the R code exits. So I tried to divide the product computation as follows:

fast_prodMat <- function(G_imputed, betas, ind.row, ind.col, ncores, df_beta, block_num=500) { # we use ind.col indirectly

        gv_samples = matrix(0, length(ind.row), ncol(betas))
        total_length = nrow(df_beta)
        block_size = as.integer(total_length/block_num)
        for(chr_block in 0:block_num) {
            print(sprintf("running for %s", chr_block))
            #col_subset = df_beta[df_beta$chr==chr_block, ][["_NUM_ID_"]]
            df_subset = df_beta %>% slice(chr_block*block_size+1:min((chr_block+1)*block_size, total_length))
            prodmat = big_prodMat(G_imputed, betas[df_subset[["_NUM_ID_"]], ], ind.row = ind.row, ind.col = df_subset[["_NUM_ID_"]], ncores=ncores)
            gv_samples = gv_samples + prodmat
        }
        return(gv_samples)
}

However, the computation is incredibly slow for me (each of the 500 steps to compute the product takes more than a minute with 24 cores). I also tried the computation with ncores=1 (to avoid parallelization overhead), but it still slows down after initial 25 steps. Is there something wrong with the way I was computing big_prodMat? This problem did not occur when I restricted myself to just Chr22.

privefl commented 1 month ago
shachideshpande commented 1 month ago

Thank you for sharing the link to the paper!

Sounds good about starting a new issue in the bigstatr repo.

About h2 and p estimates: that makes sense!

In the simulation with h2=0.8 and p=0.001, I tried setting the h2_init value to the true value (0.8) in snp_ldpred2_auto, however all chains converge to h2_est~0.2. I am producing a table of true h2/p values and estimated h2/p values here (sorry about the one missing row).

image

I call the function as follows snp_ldpred2_auto(corr, df_beta, h2_init=ldsc_h2_est, ncores = NCORES, burn_in=100,num_iter = 500, ind.corr=corr_indices, vec_p_init = seq_log(1e-4, 0.99, length.out = 50), allow_jump_sign = FALSE, shrink_corr = 0.95, report_step = 20, sparse=TRUE)

Here, corr is computed in-sample (this produced almost no NA values when passed to snp_ldpred2_auto), corr_indices are all indices within corr, df_beta contains marginal betas/std errors as computed using big_univLinReg (with n_eff=total number of individuals in training set).

Additionally, I tried supplying the true h2/p values and h2_est/p_est values (from snp_ldpred2_auto) to the snp_ldpred2_grid function for a sanity check. Interestingly, the betas produced by the snp_ldpred2_grid function blow up to a very high magnitude on certain columns when I do this (order or magnitude 10^6). How could this happen?

privefl commented 1 month ago

You cannot have missing values (or NaNs) in the correlation matrix.

privefl commented 1 month ago

I should make it error in this case I guess (cf. https://github.com/privefl/bigsnpr/issues/498).

shachideshpande commented 1 month ago

Thanks!

I believe that the correlation matrix does not contain NAs. In the question, when I wrote "this produced almost no NA values when passed to snp_ldpred2_auto", I meant that the output of snp_ldpred2_auto did not produce many NA-valued chains (sorry if this was confusing!). The ld vector that is computed from the correlation matrix (by squaring and summing across each column of the correlation matrix) does not contain any NA values (hence, the correlation matrix cannot contain NAs).

I wonder if there is a different issue with the correlation matrix. For example, I have not used the INFO scores on imputation quality of SNPs to perform QC.

Also, when I tried the pre-computed correlation matrices from the link you provided, the snp_ldpred2_auto function produced more NA-valued chains.

privefl commented 1 month ago

If you're using some publicly available GWAS summary statistics, please share the code you've tried using the pre-computed LD matrices, and I can test it.

shachideshpande commented 1 month ago

Thank you so much! Since I am working with simulations at the moment, I am simply using the big_univLinReg function to compute summary statistics.

Following is the complete source code I am using. I am attaching an example file joint_ids.txt (in the following comment) which contains the join of HM3 variant IDs with variant IDs in my genotype matrix (which would be needed when running the snpldpred2* functions). The sequence of HM3 indices present in the genotype matrix would be different for you. For this code to work, the precomputed LD block matrices would need to be present in the ldref folder, and the HM3 subset of UKBB variants should be in ukbb_hm3.bed.

library(bigsnpr)
library(dplyr)
# set simulation parameters
h2=0.8
p=0.001
# calculate LD in-sample?
insample=as.logical("FALSE")

NCORES=nb_cores()

# read bed file
if(!file.exists("ukbb_hm3.rds")) {
        G = snp_readBed2("ukbb_hm3.bed", ncores=NCORES)
}
G_attach = snp_attach("ukbb_hm3.rds")

# set number of causal SNPs
M = as.integer(p*ncol(G_attach$genotypes))

# impute the SNPs
if(!file.exists("ukbb_hm3_imputed.rds")) {
        G_imputed   <-  snp_fastImputeSimple(G_attach$genotypes, method='mean2', ncores = NCORES)
        saveRDS(G_imputed, "ukbb_hm3_imputed.rds")
} else {
       G_imputed <- readRDS("ukbb_hm3_imputed.rds")
}

# joint_ids.txt contains the join result of HM3 variant IDs with variant IDs in my genotype matrix
hm3_merge_file = "joint_ids.txt"
hm3_merge = read.table(hm3_merge_file, header=T, sep="\t")
hm3_small_seq = hm3_merge$hm3_seq # sequence of HM3 indices present in my genotype matrix

# fixing train/test/val/extra set of indices to be used
set.seed(1) # for generating train-val-test-dev split
train_indices = sort(sample(nrow(G_imputed), nrow(G_imputed)*.7))
rem_indices = sort(setdiff(rows_along(G_imputed), train_indices))
set.seed(1) # for generating train-val-test-dev split
val_indices = sort(sample(rem_indices, nrow(G_imputed)*0.1))
rem_indices = sort(setdiff(rem_indices, val_indices))
set.seed(1) # for generating train-val-test-dev split
test_indices = sort(sample(rem_indices, nrow(G_imputed)*0.1))
rem_indices = sort(setdiff(rem_indices, test_indices))

iter = 1 # first replicate of experiments
# simulate phenotype with given h2 and p 
pheno_file <- sprintf("ChrHM3All_small_%d_%d_%d_%d.rds", 100 * h2, M, 100 * 1, iter)
        if (!file.exists(pheno_file)) {
                simu <- snp_simuPheno(G_imputed, h2, M, K=NULL,  ncores = NCORES)
                saveRDS(simu, pheno_file)
                write.table(simu$pheno, sep=",", paste(pheno_file, ".csv", sep = ""), col.names=FALSE, row.names=FALSE)
  }
# perform GWAS 

gwas_file <- sprintf("ChrHM3All_small_gwas_%d_%d_%d_%d.rds", 100 * h2, M, 100 * 1, iter)
        pheno_file <- sprintf("ChrHM3All_small_%d_%d_%d_%d.rds.csv",100 * h2, M, 100 * 1, iter)
        if (!file.exists(gwas_file)) {

        y <- read.table(pheno_file, header=FALSE, sep=",")
        gwas <- big_univLinReg(G_imputed, y$V1[train_indices], ind.train=train_indices, ncores = NCORES)
        print("performed gwas")
        saveRDS(gwas, gwas_file)
        }
# generate summary statistics from above result
simu_pheno_output = (readRDS(pheno_file))
y= simu_pheno_output$pheno
set = simu_pheno_output$set
effects = simu_pheno_output$effects
allele_effects = simu_pheno_output$allelic_effects
sumstats_file = gwas_file
sumstats <- readRDS(sumstats_file)
[joint_ids.txt.zip](https://github.com/privefl/bigsnpr/files/15424698/joint_ids.txt.zip)

sumstats = (sumstats
          %>% mutate(n_eff = length(hm3_small_seq)))
sumstats$a0 = G_attach$map$allele2
sumstats$a1 = G_attach$map$allele1
sumstats$beta = sumstats$estim
sumstats$beta_se = sumstats$std.err
sumstats$chr = G_attach$map$chromosome
sumstats$rsid = G_attach$map$marker.ID

# set the right ld/corr matrices

# precomputed corr matrices

for (chr_num in 1:22) {

        hm3_ld_block_file = "ldref/LD_with_blocks_chr%d.rds"
        corr0_hm3_small = readRDS(sprintf(hm3_ld_block_file, chr_num))
        if (chr_num == 1) {
                tmp_sfbm_file = sprintf("chr_hm3_block_%d_%d_%d_ukbbsmall", chr_num, 100*h2, M)
                if(file.exists(paste(tmp_sfbm_file, ".sbk", sep="")))
                {
                        unlink(paste(tmp_sfbm_file, ".sbk", sep=""))
                }

                ld_hm3_small <- Matrix::colSums(corr0_hm3_small^2)
                corr_hm3_small <- as_SFBM(corr0_hm3_small, tmp_sfbm_file)
        } else {
        ld_hm3_small <- c(ld_hm3_small, Matrix::colSums(corr0_hm3_small^2))
        corr_hm3_small$add_columns(corr0_hm3_small, nrow(corr_hm3_small))
        }

}

ld = ld_hm3_small
corr = corr_hm3_small
corr_indices = hm3_small_seq

map <- setNames(G_attach$map[-3], c("chr", "rsid", "pos", "a1", "a0")) 
df_beta <- snp_match(sumstats, map, join_by_pos = FALSE, strand_flip = FALSE, remove_dups = FALSE) # simplest setting retaining everything for the simulated setup

# produces a much smaller estimate of h2 as compared to the true value
ldsc <- snp_ldsc(ld[corr_indices], length(ld[corr_indices]), chi2 = (df_beta$beta / df_beta$beta_se)^2, sample_size = df_beta$n_eff,  blocks = NULL, ncores = NCORES)
ldsc_h2_est <- ldsc[["h2"]]

effect_scales <- with(df_beta, sqrt(n_eff * beta_se^2 + beta^2))

# now run snp_ldpred2_auto
if(!file.exists(sprintf("sample_beta_small%d_%d_%d_%d.rds", 100*h2, M, iter, as.integer(insample)))) {

        multi_auto = snp_ldpred2_auto(corr, df_beta, h2_init=ldsc_h2_est, ncores = NCORES, burn_in=100,num_iter = 500, ind.corr=corr_indices, vec_p_init = seq_log(1e-4, 0.99, length.out = 50), allow_jump_sign = FALSE, shrink_corr = 0.95, report_step = 1, sparse=TRUE)

        # the below line gives me all NA values when I use precomputed LD matrix. It gives valid answers with insample LD matrix, but the h2 estimates are 1/4th the true value and p estimates are close to correct values only when true p < 0.01
        print(str(multi_auto[[1]], max.level = 1))
        (range <- sapply(multi_auto, function(auto) diff(range(auto$corr_est))))
        print(range)
        (keep <- which(range > (0.95 * quantile(range, 0.95, na.rm = TRUE))))
        beta_auto <- rowMeans(sapply(multi_auto[keep], function(auto) auto$beta_est))
        pred_auto <- big_prodVec(G_imputed, beta_auto, ind.row = test_indices[1:1000], ind.col = df_beta[["_NUM_ID_"]], ncores=NCORES)
        print(pcor(pred_auto, y[test_indices[1:1000]], NULL))
        print(lm(y[test_indices[1:1000]] ~ pred_auto))

        for (chain_i in 1:length(multi_auto)){
                multi_auto[[chain_i]]$sample_beta <- sweep(multi_auto[[chain_i]]$sample_beta, 1, effect_scales, '*')
        }
        # chain quality control
        all_h2 <- sapply(multi_auto, function(auto) auto$h2_est)
        auto_h2 <- median(all_h2)
        keep <- between(all_h2, 0.7 * auto_h2, 1.4 *auto_h2)
        all_p <- sapply(multi_auto, function(auto) auto$p_est)
        auto_p <- median(all_p[keep])
        keep <- keep & between(all_p, 0.5 *auto_p, 2 * auto_p)

        # model info
        model <- list(
        h2_keep = all_h2[keep],
        p_keep = all_p[keep],
        h2_est = ldsc_h2_est,
        df_beta = df_beta,
        keep = keep,
        scale = effect_scales,
        pred_auto = pred_auto,
        multi_auto = multi_auto
        )

        saveRDS(model, sample_beta_small%d_%d_%d_%d.rds", 100*h2, M, iter, as.integer(insample)))
}

model = readRDS(sprintf("sample_beta_small%d_%d_%d_%d.rds", 100*h2, M, iter, as.integer(insample)))
single_auto = model$multi_auto
# compute prs predictions (using a small subset of test data as the below functions take time to run)
pred_mat = matrix(0, 2000, 500)

# The values below are not always close to true h2/p!
print("Printing estimated h2/p")
print(c(single_auto[[1]]$h2_est, single_auto[[1]]$p_est))
# This was faster for me instead of computing for 500 samples at once
for(sample_i in 1:500) {
        pred_mat[, sample_i] <- big_prodVec(G_imputed, single_auto[[1]]$sample_beta[, sample_i], ind.row = test_indices[1:2000], ind.col = df_beta[["_NUM_ID_"]], ncores=NCORES)

}

# slope needs to be close to 1, correlations needs to be close to sqrt(h2)
print("Printing all slope/intercepts/cor - sequence[ldpred2] test")

print(lm(y[test_indices[1:1000]] ~ rowMeans(pred_mat)[1:1000]))
print(cor(y[test_indices[1:1000]], rowMeans(pred_mat)[1:1000]))
shachideshpande commented 1 month ago

The upload did not work earlier, but here is joint_ids.txt.zip

privefl commented 1 month ago

After a quick look at your code, I can see that this is probably the error: n_eff = length(hm3_small_seq); this is not the size of the training set at all.

PS: You could use the BGEN dosage data instead of bed + mean imputation.

shachideshpande commented 1 month ago

Thank you so much! Indeed, that's a good catch! I changed this to the size of training data and I see that the h2 estimates are now more sensible (0.85 when true h2 was 0.8). However, when I submit this script (with no other change), I get a new issue upon calling snp_ldpred2_auto across all submitted jobs (i.e. with varying h2 and p). This was not present with the earlier version and I haven't updated anything within my R environment.

Error in makePSOCKcluster(names = spec, ...) :
  Cluster setup failed. 29 of 29 workers failed to connect.
Calls: snp_ldpred2_auto -> <Anonymous> -> <Anonymous> -> makePSOCKcluster
Execution halted

When I search this error type, I see that this is a suggested solution cl <- parallel::makeCluster(2, setup_strategy = "sequential") https://stackoverflow.com/questions/62730783/error-in-makepsockclusternames-spec-cluster-setup-failed-3-of-3-work?answertab=votes#tab-top

privefl commented 1 month ago

Please open a new issue with this.

shachideshpande commented 1 month ago

Sounds good, thank you so much for being so prompt and helpful!