satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
212 stars 33 forks source link

Regressing out vars.to.regress takes a very long time for samples with >10k cells #195

Open professor-sagittarius opened 5 months ago

professor-sagittarius commented 5 months ago

Hi, I'm trying to run SCTransform on a 32-sample, 180k-cell Seurat V5 object with 4 vars.to.regress on various aspects of cell complexity and ambient contamination. Most of the SCTransform steps are relatively fast, but for samples with more than 10k cells, "Regressing out...." can take up to an hour. I'm surprised that the regression time doesn't scale linearly with cell number, as even samples with up to 9500 cells finish regression in a matter of seconds. It's only samples larger than 10k that slow way down.

I thought about speeding this up by running SCTransform on 32 single-sample Seurat objects as parallel jobs, but I was spooked by the fact that this approach only showed one regression step for each sample, while the multi-sample object seemed to have 3 regression steps per sample (pre- and post-residuals, and then once more after all samples had been SCTransformed). Is this approach okay, and do I need to do anything else once I merge all the objects again? Or should I stick to the multi-sample object? The purposes of the different regression steps are not clear to me, so it's difficult for me to answer this question myself. If it's best to keep the samples in the same object, how can I speed up the regression step? This seems important with the introduction of GEM-X and other methods that allow samples with up to 20k cells. :)

UPDATE: The same function call is much faster today, so I guess the speed may be a transient issue unrelated to SCTransform itself. It would still be nice to know the effect of running SCTransform in parallel on separate objects that are later merged.