sctransform::vst could run faster

dariomel commented 4 years ago

function future_lapply is used quite inefficiently within functions get_model_pars and get_model_pars_nonreg

https://github.com/ChristophH/sctransform/blob/81614def2851d690875089528fc17bd6be19d030/R/vst.R#L326 https://github.com/ChristophH/sctransform/blob/81614def2851d690875089528fc17bd6be19d030/R/vst.R#L393

In both cases, future_lapply is currently used to parallelize the inner loop, thus incurring a large overhead due to process switching, particularly in function get_model_pars_nonreg. A better solution is to use future_lapply to parallelize the outer loop, and use a simple lapply to implement the inner loop, like so

get_model_pars_nonreg <- function(genes, bin_size, model_pars_fit, regressor_data, umi, model_str_nonreg, cell_attr, show_progress) {
  bin_ind <- ceiling(x = 1:length(x = genes) / bin_size)
  max_bin <- max(bin_ind)
  if (show_progress) {
    pb <- txtProgressBar(min = 0, max = max_bin, style = 3)
  }
  model_pars_nonreg <- withCallingHandlers(do.call(rbind,future_lapply(1:max_bin,function(i) {
    genes_bin <- genes[bin_ind == i]
    mu <- tcrossprod(model_pars_fit[genes_bin, -1, drop=FALSE], regressor_data)
    umi_bin <- as.matrix(umi[genes_bin, ])
    signalCondition(simpleCondition(''))
    do.call(rbind,
                                      lapply(genes_bin, function(gene) {
                                        fam <- negative.binomial(theta = model_pars_fit[gene, 'theta'], link = 'log')
                                        y <- umi_bin[gene, ]
                                        offs <- mu[gene, ]
                                        fit <- glm(as.formula(model_str_nonreg), data = cell_attr, family = fam, offset=offs)
                                        return(fit$coefficients)
                                      }))
    })), # do.call
    condition=function(...) if (show_progress) {
      setTxtProgressBar(pb, getTxtProgressBar(pb)+1)
    }
  ) # withCallingHandlers
  if (show_progress) {
    close(pb)
  }
  rownames(model_pars_nonreg) <- genes
  return(model_pars_nonreg)
}

To illustrate, on a Ubuntu server with 32 virtual CPUs and 90 GB of RAM, using future::plan(strategy="multisession",workers=8) with the current implementation of functions get_model_pars and get_model_pars_nonreg, a call to sctransform::vst took almost 20 min:

Calculating cell attributes for input UMI matrix
Variance stabilizing transformation of count matrix of size 22207 by 8468
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 8468 cells
  |======================================================================| 100%
Found 7 outliers - those will be ignored in fitting/regularization step

Estimating parameters for following non-regularized variables: mtp
  |======================================================================| 100%
Second step: Get residuals using fitted parameters for 22207 genes
  |======================================================================| 100%
Calculating gene attributes
Wall clock passed: Time difference of 19.43755 mins

After changing the code as shown in the attached diff file, vst.R.txt, the same call to sctransform::vst ran more than 4 times faster:

Calculating cell attributes for input UMI matrix
Variance stabilizing transformation of count matrix of size 22207 by 8468
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 8468 cells
  |======================================================================| 100%
Found 7 outliers - those will be ignored in fitting/regularization step

Estimating parameters for following non-regularized variables: mtp
  |======================================================================| 100%
Second step: Get residuals using fitted parameters for 22207 genes
  |======================================================================| 100%
Calculating gene attributes
Wall clock passed: Time difference of 4.525548 mins

Similar timings were obtained by switching back and forth between the two implementations. Moreover, using future::plan(strategy="multicore",workers=8) with the current implementation of functions get_model_pars and get_model_pars_nonreg, caused the latter function to hang, likely because of a bug in package future. Interestingly, after making the changes in the attached diff file, function get_model_pars_nonreg did not hang, although in this case sctransform::vst seemed to run a bit slower than when using strategy="multisession",

Calculating cell attributes for input UMI matrix
Variance stabilizing transformation of count matrix of size 22207 by 8468
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 8468 cells
  |=========                                                             |  12%
Found 7 outliers - those will be ignored in fitting/regularization step

Estimating parameters for following non-regularized variables: mtp
  |======================================================================| 100%
Second step: Get residuals using fitted parameters for 22207 genes
  |======================================================================| 100%
Calculating gene attributes
Wall clock passed: Time difference of 5.346129 mins

ChristophH commented 4 years ago

You are right that the function could be made faster, especially when a large number of cores is available, but there is a reason for the current design.

Every step of the outer loop takes bin_size (default 256) genes and turns the sparse count matrix into a dense matrix (get_model_pars) or calculates the expected counts (get_model_pars_nonreg). The inner loop then loops over each gene in the bin and does the actual work. The outer loop is not parallelized to save RAM. If the outer loop was parallelized there would be an entire gene by cell matrix in dense format in memory. On a lot of hardware this would not work for large input matrices. The current solution has the drawback of a large overhead due to starting a new process/session for each gene, but allowed for a relatively straight forward use of parallelization while still being able to control the RAM being used. What I should probably do is to further subdivide the genes in the bin into N bins where N is the number of workers and then run future_lapply on that so that only N processes/sessions are created.

dariomel commented 4 years ago

good points. Ideally one would use multiple threads that read different blocks from the same sparse matrix in parallel, without allocating an intermediate dense matrix for each block. Unfortunately R does not support multi-threading.

ChristophH commented 3 years ago

See 595c1d3

Refactor model fitting code

Move functions to estimate model parameters into own file; determine number of workers and further split a chunk of genes into sub-blocks for model fitting - this increases multicore performance for functions that can estimate parameters for many genes at once (glmGamPoi only at the moment)

satijalab / sctransform

sctransform::vst could run faster #59