satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
206 stars 33 forks source link

Memory friendly option for those interested only in corrected counts and HVGs #56

Closed gokceneraslan closed 4 years ago

gokceneraslan commented 4 years ago

Thanks very much for this nice method and the implementation. If the size of the input matrix is really big, residuals use too much memory and in some cases we're only interested in corrected counts (which are sparse and more memory friendly) and the list of highly variable genes. So I was wondering if it'd make sense to add a return_only_residual_var option to the vst function which leaves vst_out$y field NULL but creates another field like vst_out$residual_var without storing the full y residual matrix in memory. This way we can get residual variance values and corrected sparse matrix without storing full dense y matrix. Would that be possible?

ChristophH commented 4 years ago

Hi Gökçen,

Great to hear you are using sctransform and find it useful.

To identify the highly variable genes without generating the full residual matrix, you can set residual_type = 'none' in sctransform::vst(). That will leave the residual matrix in the return object empty but the model parameters are estimated. In the next step you can use the sctransform::get_residual_var() function to get the residual variance (again without saving all the residuals). (This approach is also taken in Seurat::SCTransform with conserve.memory = TRUE)

To get the corrected counts you can then use sctransform::correct_counts(). This function returns the corrected counts as a sparse matrix for all genes. There is currently no option to have it return only a subset of the genes, but if you think that would be helpful, I can certainly add that.

Let me know if that workflow makes sense and works for you. There would be room for improvement, since this way you are calculating the residuals on the fly twice (once to get the residual variance, once to get the corrected counts). It's a minor point since overall the estimation of model parameters would still dominate the runtime.

Cheers, Christoph

gokceneraslan commented 4 years ago

Oh, perfect! Sorry, I didn't know that get_residual_var() can calculate residuals on the fly. Thanks a lot!

gokceneraslan commented 4 years ago

Hi @ChristophH, I have two more questions, if I may:

1) https://github.com/ChristophH/sctransform/blob/master/R/denoise.R#L146 here it seems like the batch effects are not corrected in correct_counts() function, is that correct? It's only the median of all numerical variables. However, the results look batch corrected (edit: which is possibly bcs HVGs are not affected by batch effects). I wonder if it makes sense to set batch category to zero (or to a given reference batch category) during reverse regression step, if batch_var is given. Otherwise shouldn't it re-introduce the batch effects? If you do not prefer that maybe a warning would be good, because users might think that corrected data is also batch corrected.

2) In ComBat's batch correction, there is a nice trick. The model also accepts biological covariates and use these in model fit but only the batch variable is regressed out. I was wondering if you're planning to add something similar. With existing experimental designs it's bit hard to use such variables in single cell, but might be useful for some people.

Cheers.