mash() takes weeks to compute posterior matrices

resztak commented 4 years ago

I'm running mash on eQTL mapping results (~4M tests*5 conditions). mash(), specifically "Computing posterior matrices" has been running for over two weeks on 80GB of memory. Is it expected to take this long? Is there a way to speed it up? Here's my code and preprocessed input data for five conditions: http://genome.grid.wayne.edu/SCAIP_mashr_github/Tcell_mash-ready-data.Rd ./mashr_Tcell.R ./mashR_Tcell_prep.R

gaow commented 4 years ago

@resztak It should not take long to fit mash model using "top" and "random" subset of effects, as was the case with Urbut 2019 where we used ~20K top and ~40K random effects. Applying the model learned to obtain posterior on 4M effects can take long time if you do it on a desktop computer. What we have done in practice is to partition the 4M effects into batches of, say 40K effects, and use an HPC cluster to do the computation. To demonstrate, this is what we did for the recent GTEx paper (V8):

https://github.com/stephenslab/gtexresults/blob/master/workflows/mashr_flashr_workflow.ipynb

see section "Compute MASH posteriors". You dont have to use our pipeline as is if you are not conformable or familiar with it. But hopefully you have an idea of how it is done.

BTW, still I am surprised it takes weeks for just 5 conditions. Have you tried on a small toy data eg with 10K effects and see how long that takes? That will also give you an idea how long it might take to complete everything.

resztak commented 4 years ago

Thanks! I tried using mash_compute_posterior_matrices(g=m.ed,data=data.subset) on a subset of 10k tests, and on pre-fit m.ed=mash(data, U.ed,outputlevel=1) it takes 12 seconds, while 100k tests take ~17min. I should be able to get enough cores to chop up the data into 100k chunks and get it done quickly.

stephens999 commented 4 years ago

does anyone have any idea why does it take more than 10x longer for 100k vs 10k? The computation should scale linearly with number of tests so if 10k takes 12s, 100k should take about 120s, not 17min. Am I missing something?

gaow commented 4 years ago

@stephens999 good question we have to reproduce and see what's going on. @resztak just to double check what model you are using (so we can focus on the right section of code) -- when you set mash data with mash_set_data, did you use alpha=0 or alpha=1 (default is 0)?

pcarbo commented 4 years ago

mash can be memory-intensive. My first guess is that it is using a lot of memory and perhaps using virtual memory which can be very, very low. @resztak Can you tell us more about your computing setup? How much memory (RAM) do you have?

resztak commented 4 years ago

I am running this on the cluster, so the 80GB is solely dedicated to the mash process. I haven't changed the default alpha when setting the data. Should I have?

stephens999 commented 4 years ago

that sounds like plenty of memory for 10k or 100k effects...

@resztak do you repeatedly see that running 100k takes much longer than ten lots of 10K run sequentially? That seems weird...

resztak commented 4 years ago

Yes, the 12s/10k and ~17min/100k is repeatable, even on less memory.

pcarbo commented 4 years ago

@resztak If you are able to share with us your data & code, I will try to reproduce your timings on our computing systems.

resztak commented 4 years ago

Sure! http://genome.grid.wayne.edu/SCAIP_mashr_github/10-11-2020/

pcarbo commented 4 years ago

@resztak Thanks for sharing this.

@gaow I was able to replicate the issue. More disconcerning, I found that the runtime is increasing quadratically in the number of observations (rows of data$Bhat). The computations should be linear in the sample size, right? The bulk of the computation appears to be happening within compute_posterior_matrices.

Let me know if there are any other tests I can run to collect more information about the issue.

n     elapsed(s)
10000     12.676
20000     53.962
40000    217.690

Also, I found that memory usage is about 5 GB in this example.

gaow commented 4 years ago

The computations should be linear in the sample size, right?

Yes. The loop implemented here is straightfoward loop over sample size ...

@pcarbo One test we can try is to use alogrithm='R' see if we can reproduce it, and if so, we can line profile memory in R. Otherwise, the only suspicious C++ code might be the openMP based parallelization that we can try to check next (simply by commenting openMP out from the Makevars file)?

pcarbo commented 4 years ago

Okay, I will take a look.

FYI, here is the simplified script I ended up using to generate my timings above:

library(ashr)
library(mashr)
set.seed(1)
load("Tcell_mash-ready-data.Rd")
timing <- system.time(m.ed <- mash(data,U.ed,outputlevel = 1))
print(timing)
# grab a random subset of rows to test how long it takes.
rows <- sample(nrow(data$Bhat),10000)
data2 <- mash_set_data(data$Bhat[rows,],data$Shat[rows,])
timing <- system.time(
  out <- mash_compute_posterior_matrices(g = m.ed,data = data2))
print(timing)

pcarbo commented 4 years ago

@gaow I created a new no-openmp branch to perform additional tests. Even after removing all the OpenMP code, I still see the same sharp rise in runtime of mash_compute_posterior_matrices as the number of samples goes up:

10,000: 13 s
20,000: 55 s
30,000: 130 s

So clearly this "bug" is not due to OpenMP.

Alice-MacQueen commented 3 years ago

I am also having this issue and found independently/redundantly that I could track the slowdown to the code of compute_posterior_matrices.

There is so much matrix math in this function, and as you say, its runtime ideally would scale linearly by rows. Mash is also memory intensive. Would you consider employing the big_apply or big_parallelize functions from the bigstatsr package (https://privefl.github.io/bigstatsr/index.html) to speed up this function for matrices with more rows? That package can handle matrices that are too large to fit in memory, by memory-mapping to a binary file on disk (which could reduce mash's memory intensive nature). It also lets you apply standard R functions to large matrices in parallel and significantly decreases their run time. For some examples, see: https://privefl.github.io/R-presentation/bigstatsr.html#1

stephens999 commented 3 years ago

we could consider that, but first we should probably track down the source of this issue because computation should not increase quadratically with number of tests....

On Thu, Apr 1, 2021 at 11:52 AM Alice MacQueen @.***> wrote:

I am also having this issue and found independently/redundantly that I could track the slowdown to the code of compute_posterior_matrices.

There is so much matrix math in this function, and as you say, its runtime ideally would scale linearly by rows. Mash is also memory intensive. Would you consider employing the big_apply or big_parallelize functions from the bigstatsr package (https://privefl.github.io/bigstatsr/index.html) to speed up this function for matrices with more rows? That package can handle matrices that are too large to fit in memory, by memory-mapping to a binary file on disk (which could reduce mash's memory intensive nature). It also lets you apply standard R functions to large matrices in parallel and significantly decreases their run time. For some examples, see: https://privefl.github.io/R-presentation/bigstatsr.html#1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/mashr/issues/87#issuecomment-812038261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRIDYW7P4PIEGMT5Z4TTGSQFFANCNFSM4SH23Q5A .

Alice-MacQueen commented 3 years ago

That makes sense. Any reason not to use something like the following code chunk in the meantime? It subsets a larger object with effects and standard errors, calculates posterior weights for each subset, then combines the objects into a mash object at the end.

A bit of a kludge, but using all.equal on the sub-components doesn't find any differences for marker data computed using subsets of different sizes.

Unless I am missing something?

## Batch process SNPs through this, don't run on full set if > 20000 rows.
  ## Even for 1M SNPs, because computing posterior weights scales quadratically
  ## with the number of rows in Bhat and Shat. 10K  = 13s, 20K = 55s; 40K = 218s
  ## By my calc, this starts getting unwieldy between 4000 and 8000 rows.
  ## See mash issue: https://github.com/stephenslab/mashr/issues/87
  if(gwas2$nrow > 20000){
    subset_size <- 4000
    n_subsets <- ceiling(gwas2$nrow / subset_size)
    printf2(verbose = verbose, "\nSplitting data into %s sets of 4K markers to speed computation.\n",
            n_subsets)
    for (i in 1:n_subsets) {
      if(i < n_subsets){
        from <- (i*subset_size - (subset_size - 1))
        to <- i*subset_size
        row_subset <- from:to
      } else {
        from <- n_subsets*subset_size - (subset_size - 1)
        to <- gwas2$nrow
        row_subset <- from:to
      }
      Bhat_subset <- as.matrix(gwas2[row_subset, ind_estim])   # columns with effect estimates
      Shat_subset <- as.matrix(gwas2[row_subset, ind_se])      # columns with SE
      data_subset <- mashr::mash_set_data(Bhat_subset, Shat_subset, V = Vhat)
      m_subset = mashr::mash(data_subset, g = ashr::get_fitted_g(m), fixg = TRUE)

      if (i == 1){
        m2 <- m_subset
      } else {     # make a new mash object with the combined data.
        PosteriorMean = rbind(m2$result$PosteriorMean, m_subset$result$PosteriorMean)
        PosteriorSD = rbind(m2$result$PosteriorSD, m_subset$result$PosteriorSD)
        lfdr = rbind(m2$result$lfdr, m_subset$result$lfdr)
        NegativeProb = rbind(m2$result$NegativeProb, m_subset$result$NegativeProb)
        lfsr = rbind(m2$result$lfsr, m_subset$result$lfsr)
        posterior_matrices = list(PosteriorMean = PosteriorMean,
                                  PosteriorSD = PosteriorSD,
                                  lfdr = lfdr,
                                  NegativeProb = NegativeProb,
                                  lfsr = lfsr)
        loglik = m2$loglik # NB must recalculate from sum(vloglik) at end
        vloglik = rbind(m2$vloglik, m_subset$vloglik)
        null_loglik = c(m2$null_loglik, m_subset$null_loglik)
        alt_loglik = rbind(m2$alt_loglik, m_subset$alt_loglik)
        fitted_g = m2$fitted_g        # all four components are equal
        posterior_weights = rbind(m2$posterior_weights, m_subset$posterior_weights)
        alpha = m2$alpha  # equal
        m2 = list(result = posterior_matrices,
                  loglik = loglik,
                  vloglik = vloglik,
                  null_loglik = null_loglik,
                  alt_loglik = alt_loglik,
                  fitted_g = fitted_g,
                  posterior_weights = posterior_weights,
                  alpha = alpha)
        class(m2) = "mash"
      }
    }
    loglik = sum(m2$vloglik)
    m2$loglik <- loglik
      # total loglik in mash function is: loglik = sum(vloglik)
  } else {
    Bhat_full <- as.matrix(gwas2[, ind_estim])
    Shat_full <- as.matrix(gwas2[, ind_se])
    data_full <- mashr::mash_set_data(Bhat_full, Shat_full, V = Vhat)
    m2 = mashr::mash(data_full, g = ashr::get_fitted_g(m), fixg = TRUE)
  }

gaow commented 3 years ago

Thanks to pointer from @pcarbo on a coding mistake I made, this issue has now been fixed in version 0.2.45 (current master).

stephenslab / mashr

mash() takes weeks to compute posterior matrices #87