Closed dariomel closed 3 years ago
You are right that the function could be made faster, especially when a large number of cores is available, but there is a reason for the current design.
Every step of the outer loop takes bin_size
(default 256) genes and turns the sparse count matrix into a dense matrix (get_model_pars
) or calculates the expected counts (get_model_pars_nonreg
). The inner loop then loops over each gene in the bin and does the actual work.
The outer loop is not parallelized to save RAM. If the outer loop was parallelized there would be an entire gene by cell matrix in dense format in memory. On a lot of hardware this would not work for large input matrices. The current solution has the drawback of a large overhead due to starting a new process/session for each gene, but allowed for a relatively straight forward use of parallelization while still being able to control the RAM being used.
What I should probably do is to further subdivide the genes in the bin into N bins where N is the number of workers and then run future_lapply
on that so that only N processes/sessions are created.
good points. Ideally one would use multiple threads that read different blocks from the same sparse matrix in parallel, without allocating an intermediate dense matrix for each block. Unfortunately R does not support multi-threading.
See 595c1d3
Refactor model fitting code
Move functions to estimate model parameters into own file; determine number of workers and further split a chunk of genes into sub-blocks for model fitting - this increases multicore performance for functions that can estimate parameters for many genes at once (glmGamPoi only at the moment)
function
future_lapply
is used quite inefficiently within functionsget_model_pars
andget_model_pars_nonreg
https://github.com/ChristophH/sctransform/blob/81614def2851d690875089528fc17bd6be19d030/R/vst.R#L326 https://github.com/ChristophH/sctransform/blob/81614def2851d690875089528fc17bd6be19d030/R/vst.R#L393
In both cases,
future_lapply
is currently used to parallelize the inner loop, thus incurring a large overhead due to process switching, particularly in functionget_model_pars_nonreg
. A better solution is to usefuture_lapply
to parallelize the outer loop, and use a simplelapply
to implement the inner loop, like soTo illustrate, on a Ubuntu server with 32 virtual CPUs and 90 GB of RAM, using
future::plan(strategy="multisession",workers=8)
with the current implementation of functionsget_model_pars
andget_model_pars_nonreg
, a call tosctransform::vst
took almost 20 min:After changing the code as shown in the attached diff file, vst.R.txt, the same call to
sctransform::vst
ran more than 4 times faster:Similar timings were obtained by switching back and forth between the two implementations. Moreover, using
future::plan(strategy="multicore",workers=8)
with the current implementation of functionsget_model_pars
andget_model_pars_nonreg
, caused the latter function to hang, likely because of a bug in packagefuture
. Interestingly, after making the changes in the attached diff file, functionget_model_pars_nonreg
did not hang, although in this casesctransform::vst
seemed to run a bit slower than when usingstrategy="multisession"
,