satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
206 stars 33 forks source link

add glmGamPoi to increase SCT speed #66

Closed yuhanH closed 4 years ago

yuhanH commented 4 years ago

I add glmGamPoi to run negative binomial regression, and it can increase sctransform speed significantly. For the pbmc3k, it decreases the time from 53s to 18s, and residuals don't change. Because the glmGamPoi is not in CRAN, I add glmGamPoi to Suggests in DESCRIPTION

library(SeuratData)
data("pbmc3k")

umi <- GetAssayData(object = pbmc3k[["RNA"]], slot = 'counts')

View(vst)
t1<- Sys.time()
residual.poisson<- vst(umi = umi, method = "poisson" )
t2<- Sys.time()
residual.glm <- vst(umi = umi, method = "glmGamPoi" )
t3<- Sys.time()

# poisson time
t2-t1

# glmGamPoi time
t3-t2

# check the residual correlation
rownames(residual.poisson$y) == rownames(residual.glm$y)
residual.cor <- sapply(1:nrow(residual.poisson$y), function(x)  cor(residual.poisson$y[x,], residual.glm$y[x,])) 
hist(residual.cor, breaks = 200)
ChristophH commented 4 years ago

Looking good! Thank you! Two issues: 1) Could you make future pull request off the develop branch? 2) When does glmGamPoi return a Inf theta? What do the other approaches do in those cases? I am hesitant to remove those genes from the output without knowing what genes are affected. Could we override the Inf value with something sensible instead?

const-ae commented 4 years ago

Hi Christoph,

I am the author of glmGamPoi and would be delighted to see this integrated into sctransform. If @yuhanH is busy, I could also create a new PR to address your comments :)

When does glmGamPoi return a Inf theta? What do the other approaches do in those cases?

glmGamPoi infers the overdispersion for each gene which is the inverse of your theta. If there is no evidence that a gene is overdispersed, I assume a Poisson GLM instead of a Gamma-Poisson GLM and set the overdispersion estimate to 0 (ie. theta = Inf).

I am hesitant to remove those genes from the output without knowing what genes are affected. Could we override the Inf value with something sensible instead?

I fully agree that you shouldn't remove the genes with theta = Inf, instead one could introduce a minimum overdispersion of 10^-8 (which is the limit from DESeq2).

Best, Constantin

satijalab commented 4 years ago

Thanks all. I agree that excluding genes is not ideal, but we do need to test the effect of replacing Inf with 10e-8 on the regularization. As long as the results look good we can make this change.

ChristophH commented 4 years ago

Thanks @const-ae for chiming in and developing such a great package!

I'd like to enforce a minimum overdispersion - so whenever glmGamPoi assumes a Poisson and returns theta = Inf (overdispersions = 0) a high finite value for theta should be used instead. Since the interpretation of theta is tied to the (expected) mean, I propose theta = m / 1e-5, where m could be the average of the expected values (mu), or simply the geometric mean of the observed counts. Here the 1e-5 is based on observations when using sctransform with glm.nb on the pbmc_10k_v3 data - all but one gene out of 2k have gmean / theta > 1e-5, so this seems to be a good limit and should not mess up regularization.

const-ae commented 4 years ago

Thanks for the kind words :)

I'd like to enforce a minimum overdispersion - so whenever glmGamPoi assumes a Poisson and returns theta = Inf (overdispersions = 0) a high finite value for theta should be used instead

Yes, that sounds like a good idea to me.

Since the interpretation of theta is tied to the (expected) mean, I propose theta = m / 1e-5, where m could be the average of the expected values (mu), or simply the geometric mean of the observed counts. Here the 1e-5 is based on observations when using sctransform with glm.nb on the pbmc_10k_v3 data - all but one gene out of 2k have gmean / theta > 1e-5, so this seems to be a good limit and should not mess up regularization.

I am not too familiar with your exact regularization procedure, but that heuristic seems reasonable to me.

Do you want me to give it a try implement it or do you want to edit the PR yourself?

ChristophH commented 4 years ago

I'll go ahead and make the necessary edits

satijalab commented 4 years ago

I tried this using pbmc3k, (code added below), and think its reasonable. residual variances where highly correlated with/without this modification. this did create some extremely high values of theta (~15,000). perhaps that is the point (no overdispersion), but i do worry a bit about having that level of outlier overwhelm the regularization. i would suggest a slightly different denominator (1e-2 or 1e-3) but as the influence on the final results seem to be minimal, i'd trust christoph's judgement

satijalab commented 4 years ago
  model_pars <- get_model_pars(genes_step1, bin_size, umi, model_str, cells_step1, method, data_step1, theta_given, show_progress)
  inf_genes <- which(model_pars[,1]==Inf)
  inf_means <- 10^(genes_log_gmean_step1[names(inf_genes)])
  model_pars[inf_genes,1] <- inf_means / 1e-5
ChristophH commented 4 years ago

Yeah... I'm playing around with some data sets and am starting to think that it's probably better to move away from directly regularizing theta (or log10(theta)), but instead transform to some relative measure of overdispersion before regularization

the problem is not glmGamPoi specific, though, so I'll address that separately

I'll stick to 1e-5 for now

ChristophH commented 4 years ago

FYI, here are the parameters for pbmc_3k for the following three methods A) poisson + theta.ml B) glm.nb C) glmGamPoi Rplot

On a single core, runtimes with default parameters are 54s, 303s, 27s

glmGamPois could be much faster if I called it on blocks of genes and not on individual genes - that's something I want to change

ChristophH commented 4 years ago

@yuhanH Thank you for this PR! Instead of fixing the conflicts and then adding the changes discussed here, I decided to add things myself in a9e1314