privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Physical memory error when running LDpred2 over multiple chromosomes #132

Closed GRealesM closed 4 years ago

GRealesM commented 4 years ago

Hi Florian,

I hope you're well. I've been trying to run LDpred2 after implementing the QC steps that you suggested, and I included multiple init_p values for it to fully benefit for parallel computing. Also, I used 1000G Phase I reference panel following the tutorial, but filtered by the HapMap3 variants at the bottom, instead of thinning. I tried to compute it on our HPC both interactively and as a slurm job. See below for my sessionInfo() and the function I'm running:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.8 (Nitrogen)

Matrix products: default
BLAS:   /usr/local/software/spack/spack-0.11.2/opt/spack/linux-rhel7-x86_64/gcc-5.4.0/r-3.6.1-zrytncqvsnw5h4dl6t6njefj7otl4bg4/rlib/R/lib/libRblas.so
LAPACK: /usr/local/software/spack/spack-0.11.2/opt/spack/linux-rhel7-x86_64/gcc-5.4.0/r-3.6.1-zrytncqvsnw5h4dl6t6njefj7otl4bg4/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bigsnpr_1.4.11  bigstatsr_1.2.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         magrittr_1.5       cowplot_1.0.0      tidyselect_1.1.0  
 [5] munsell_0.5.0      doParallel_1.0.15  lattice_0.20-41    colorspace_1.4-1  
 [9] R6_2.4.1           rlang_0.4.7        foreach_1.5.0      dplyr_1.0.0       
[13] parallel_3.6.1     grid_3.6.1         data.table_1.13.0  gtable_0.3.0      
[17] bigparallelr_0.2.3 iterators_1.0.12   ellipsis_0.3.1     tibble_3.0.3      
[21] lifecycle_0.2.0    crayon_1.3.4       Matrix_1.2-18      bigsparser_0.4.0  
[25] bigassertr_0.1.3   purrr_0.3.4        ggplot2_3.3.2      flock_0.7         
[29] vctrs_0.3.2        codetools_0.2-16   glue_1.4.1         compiler_3.6.1    
[33] pillar_1.4.6       generics_0.0.2     scales_1.1.1       pkgconfig_2.0.3

And my function:

library(bigsnpr)
## To make them reproducible!
set.seed(1)

computePGS.ldpred2.auto  <- function(sumstats, ancestry="EUR", iterations=3000, initial_p= seq_log(1e-4, 0.9, 15)){

    message("Computing PGS using LDpred2 auto.")
    message("Loading summary statistics dataset: ", sumstats,".")
    NCORES <- nb_cores()
    sumstats <- bigreadr::fread2(sumstats)

    # Check for minimum columns
        mincol <- c("CHR19","BP19","SNPID","REF","ALT","BETA","SE","P","n_eff")
        if (!all(mincol %in% names(sumstats))) 
            stop("All minimum columns should be present in the summary statistics da
tase    t. Please check, missing columns: ", 
                paste(setdiff(mincol, names(sumstats)), collapse = ", "))
    # We don't need all columns, so we can filter them by name
    sumstats <- sumstats[,c("CHR19","BP19","SNPID","REF","ALT","BETA","SE","P","n_eff")]
    # Dear snp_match require specific names, so let's abide
    names(sumstats)[c(1:2,4:7)]  <- c("chr","pos","a0","a1","beta", "beta_se")

    sumstats <- na.omit(sumstats)

    results <- data.frame()
    for(chrs in 1:22){

        message("Working on chromosome ", chrs)
        snp_readBed(paste("../references/LDpred2-ref/chr",chrs,".bed", sep=""))
        # Attach the "bigSNP" object in R session
        # This object contain SNP data from a single chromosome from 1000 genomes phase 1.
        # This object contains a matrix of genotypes (analogous to bed), as well as fam and map slots, PLINK format.
        obj.bigSNP <- snp_attach(paste("../references/LDpred2-ref/chr",chrs, ".rds", sep=""))

        # See how the file looks like
        #str(obj.bigSNP, max.level = 2, strict.width = "cut")

        # In this case, we'll focus only on individuals of european ancestry
        euridx  <- grep(ancestry, obj.bigSNP$fam$family.ID)
        eurset  <- subset(obj.bigSNP, ind.row=euridx)
        obj.bigSNP <- snp_attach(eurset)

        # Get aliases for useful slots
        G   <- obj.bigSNP$genotypes
        CHR <- obj.bigSNP$map$chromosome
        POS <- obj.bigSNP$map$physical.pos
        #y   <- obj.bigSNP$fam$affection - 1
        #NCORES <- nb_cores()

        sumstats.chr <- sumstats[sumstats$chr == chrs,]

        # Also, we must remove all positions with se = 0 (otherwise it will complain
        sumstats.chr <- sumstats.chr[sumstats.chr$beta_se > 0,]
        # Matching variants between genotype data and summary statistics
        # To match variants contained in genotype data and summary statistics, the variables "chr" (chromosome number), "pos" (genetic position), "a0" (reference allele) and "a1" (derived allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.
        # This step is analogous to the alignment we do when reducing
        # We need to calculate the Neff from the summary statistics file to be used later by snp_ldsc2, which performs a LDscore regression. 
        #sumstats$n_eff <- 4 / (1 / sumstats$n_case + 1 / sumstats$n_control)
        #sumstats$n_case <- sumstats$n_control <- NULL
        map <- obj.bigSNP$map[-(2:3)]
        names(map) <- c("chr", "pos", "a0", "a1")

        message("Matching and aligning SNPs to the reference")
        info_snp.chr <- snp_match(sumstats.chr, map)

        message("Original sumstat had ", nrow(sumstats.chr)," for chr",chrs,". After matching ", nrow(info_snp.chr)," remained, and will be used for further steps.")
        # Computing LDpred2 scores for one chromosome

        #Correlation

        # First, you need to compute correlations between variants. We recommend to use a window size of 3 cM (see ref).

        # snp_asGeneticPos 
        POS2 <- snp_asGeneticPos(CHR, POS, dir = "../references/LDpred2-ref", ncores = NCORES)

        ## indices in info_snp
        #ind.chr <- which(info_snp$chr == 1)
        df_beta <- info_snp.chr[, c("beta", "beta_se", "n_eff")]

        ## indices in G
        ind.chr2 <- info_snp.chr$`_NUM_ID_`
        message("Computing correlation matrix for chr",chrs)

        corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES, infos.pos = POS2[ind.chr2], size = 3 / 1000)
        corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))

# We need to estimate h2 from the data, using ldsc which performs a LDscore regression. 
        message("Using LD score regression to estimate h2.")
        ldsc <- snp_ldsc2(corr0, df_beta)
        h2_est  <- ldsc[[2]]
        message("Estimated h2:",h2_est)
        if(h2_est < 1e-4){
            message("h2_est is too low, so I can't proceed, skipping chr ", chr,".")
            next
        }
        # We can also use the Automatic model to test many different h2 and p parameters and choose the best model
# takes a few minutes
        message("Computing LDpred2 auto, with initial p = ",paste(initial_p, collapse=", "), ", initial h2 = ", h2_est,", and ",iterations," iterations (+1000 burn-in)")
        multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est, vec_p_init = initial_p,  num_iter=iterations, ncores = NCORES)
        beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)
        # We average the resulting betas
        info_snp.chr$beta_auto <- rowMeans(beta_auto)
        results <- rbind(results,info_snp.chr)
        rm(info_snp.chr, corr, corr0, POS2, obj.bigSNP)
        unlink("../references/LDpred2-ref/*bk")
    }
        return(results)
}

However, I keep getting an error along the lines of the one below, which resembles that one of #96 :

Computing correlation matrix for chr8

 *** caught bus error ***
address 0x2b026390b000, cause 'non-existent physical address'

Traceback:
 1: write_indval(sbkfile, spmat@i, spmat@x, 0L)
 2: .Object$initialize(...)
 3: initialize(value, ...)
 4: initialize(value, ...)
 5: new("SFBM", spmat = spmat, backingfile = backingfile, symmetric = sym)
 6: bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
 7: computePGS.ldpred2.auto("../datasets/AST_Demenais_29273806_1-sAUCfilt.tsv.gz
")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 

I investigated the issue (maybe the answer is the same as for #96) and I believe that it has something to do with the fact that bigsparser::as_SFBM writes the matrix to disk (in order not to overload RAM, I presume). I believe that the problem arises when the files, stored at /tmp (correct me if I'm wrong!) reach 16GB, which is the limit in my case; then, everything collapses.

I'd be very interested to know if you agree with my diagnostic, if there's an alternative way to do it (maybe a way to provide snp_ldpred2_auto with a matrix from RAM, rather than one written to the disk? I know this might be harder to implement, any other idea?), and if my code for running LDpred2 is more or less correct, so I can run LDpred2-auto as accurately as possible.

Thanks again for your useful comments on our previous attempt. Cheers!

privefl commented 4 years ago
  1. The 1000G data in the tuto is just some fake data for the example, if you can use a larger reference panel (e.g. 10K indivi in UKBB or your data, it would probably be better).

  2. You don't need to use subset(), you can instead use parameter ind.row in snp_cor().

  3. Since you have the latest versions of the packages, you can do corr <- as_SFBM(corr0) instead of corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix")), that will save you a bit of time and memory.

  4. I've learned over the years that this is bad practice to write large files in \tmp, you should rather make some tmp-data directory and write there (as in https://github.com/privefl/paper-ldpred2/blob/master/code/run-ldpred2.R#L66-L68). I could probably do this in the tutorial as well. Does this solve your problem?

GRealesM commented 4 years ago

Thanks for your quick reply! Regarding (1) would a subset of European individuals from 1000Genomes Phase III reference panel (which I already have in bed, and would be easy for me to use) be a valid panel? Since I don't have access to UKBB individual data at the moment. For (2), (3), and (4), I will try to do that. Hopefully this will solve my problem, I'll let you know if it does. Thanks!

GRealesM commented 4 years ago

Just to clarify: I'm trying to obtain PGS models (ie. weighted effects, the output of snp_ldpred2_auto) using QC'ed summary statistics. I don't have an external individual-data dataset to tune the parameters, which is why I'm using ldpred2-auto. For what I understand, in order to compute corr, we need individual genotypes which can come from a panel. Then, in the tuto you use 1000 genomes phase I, and you simulate a fake phenotype to test your model and choose the best from a number of models created using different h_est and init_p values. I understand that that's the reason why you say that the data is fake, but actually I only need the panel to compute the LD matrix corr, and I believe it would be reasonable to use 1000G phase III to that end. Is my rationale right? Merci!

bvilhjal commented 4 years ago

Hi,

Thanks for your interest in LDpred2. I recommend that you use a larger sample size for the LD reference than the 1000G data. I believe you'd ideally need between 2000-10000 individuals for your LD reference. Note, you can use the test/validation data as your LD reference without a (meaningful) risk of bias. Not sure what we can recommend if you don't have access to such data. We might consider to provide LD for a "good" subset of variants based on UKB in the future (this would not be individual-level data), but we currently do not support that. Also, this would be an annoyingly large file to download.

Best, Bjarni

privefl commented 4 years ago

I don't remember exactly, but the 22 LD matrices based on HM3 variants are like 3GB total.

bvilhjal commented 4 years ago

That's not too bad! Maybe we can make them available to users. It would also avoid issues related to poorly QC'd LD references, etc.

privefl commented 4 years ago

I don't know.. with all the issues reported for some variants in the UKBB lately, I'm not even sure that it would be a very good ref for now.

GRealesM commented 4 years ago

That sounds like a good idea, Florian. Would it be possible for you to make LD matrices based on HapMap3 variants available for users? I think that would largely solve my problem, and would also be helpful for other users who don't have a big panel at hand to compute LD matrices from. Thanks a lot for your comments and help!

privefl commented 4 years ago

I'll think about it. As I said, UKBB data is not perfect at all.

Does writing to another directory than "\tmp" solve your problem?

GRealesM commented 4 years ago

I will try that, but as Bjarni said, my 1000Genomes panel is not large enough for generating the LD matrices, so I'm not sure if it's even worth the try.

privefl commented 4 years ago

It is large enough to perform the analysis, just you would probably get some slightly better results with a larger ref panel.

GRealesM commented 4 years ago

Ok, I will try to run with it and I'll let you know.

privefl commented 4 years ago

But what are you doing these PGS? If you test the prediction somwhere else, you could maybe use this data for computing the LD.

GRealesM commented 4 years ago

I'm comparing LDpred2-auto performance with our method RápidoPGS, using an evaluation method that uses summary-level data only. I will test their prediction using individual data in the future, but right now I'm interested in how the models compare at the summary-statistic level. I will keep in mind though that with bigger individual datasets, LDpred2-auto results will likely improve. I'll try to get the 1KG panel up and running and I'll let you know if I managed to fix the issue. Thanks a lot, both of you for your comments and suggestions.

privefl commented 4 years ago

I'll likely make an LD reference available in the future. But, as we have some issues with the cluster at the moment, it might not be before 1-2 months.

GRealesM commented 4 years ago

Ok, it worked when I explicitly added a line to remove the temp file after each iteration, thus releasing memory in /tmp.

temp_file <-  tempfile()
[...]
unlink(paste0(temp_file,"*"))

Thanks!

privefl commented 4 years ago

Please note that I've just released some LD reference here: https://doi.org/10.6084/m9.figshare.13034123. There is also one example script on how to use it there. A new version of the paper should appear on bioRxiv tomorrow.

privefl commented 4 years ago

Thank you for using LDpred2. Please note that we now recommend running LDpred2 genome-wide instead of per chromosome. The paper (preprint) and tutorial have been updated.