Closed yuhanH closed 4 years ago
Looking good! Thank you! Two issues: 1) Could you make future pull request off the develop branch? 2) When does glmGamPoi return a Inf theta? What do the other approaches do in those cases? I am hesitant to remove those genes from the output without knowing what genes are affected. Could we override the Inf value with something sensible instead?
Hi Christoph,
I am the author of glmGamPoi
and would be delighted to see this integrated into sctransform
. If @yuhanH is busy, I could also create a new PR to address your comments :)
When does glmGamPoi return a Inf theta? What do the other approaches do in those cases?
glmGamPoi
infers the overdispersion for each gene which is the inverse of your theta
. If there is no evidence that a gene is overdispersed, I assume a Poisson GLM instead of a Gamma-Poisson GLM and set the overdispersion estimate to 0 (ie. theta = Inf
).
I am hesitant to remove those genes from the output without knowing what genes are affected. Could we override the Inf value with something sensible instead?
I fully agree that you shouldn't remove the genes with theta = Inf
, instead one could introduce a minimum overdispersion of 10^-8
(which is the limit from DESeq2
).
Best, Constantin
Thanks all. I agree that excluding genes is not ideal, but we do need to test the effect of replacing Inf with 10e-8 on the regularization. As long as the results look good we can make this change.
Thanks @const-ae for chiming in and developing such a great package!
I'd like to enforce a minimum overdispersion - so whenever glmGamPoi
assumes a Poisson and returns theta = Inf
(overdispersions = 0
) a high finite value for theta
should be used instead. Since the interpretation of theta
is tied to the (expected) mean, I propose theta = m / 1e-5
, where m
could be the average of the expected values (mu), or simply the geometric mean of the observed counts. Here the 1e-5
is based on observations when using sctransform with glm.nb
on the pbmc_10k_v3
data - all but one gene out of 2k have gmean / theta > 1e-5
, so this seems to be a good limit and should not mess up regularization.
Thanks for the kind words :)
I'd like to enforce a minimum overdispersion - so whenever glmGamPoi assumes a Poisson and returns theta = Inf (overdispersions = 0) a high finite value for theta should be used instead
Yes, that sounds like a good idea to me.
Since the interpretation of theta is tied to the (expected) mean, I propose theta = m / 1e-5, where m could be the average of the expected values (mu), or simply the geometric mean of the observed counts. Here the 1e-5 is based on observations when using sctransform with glm.nb on the pbmc_10k_v3 data - all but one gene out of 2k have gmean / theta > 1e-5, so this seems to be a good limit and should not mess up regularization.
I am not too familiar with your exact regularization procedure, but that heuristic seems reasonable to me.
Do you want me to give it a try implement it or do you want to edit the PR yourself?
I'll go ahead and make the necessary edits
I tried this using pbmc3k, (code added below), and think its reasonable. residual variances where highly correlated with/without this modification. this did create some extremely high values of theta (~15,000). perhaps that is the point (no overdispersion), but i do worry a bit about having that level of outlier overwhelm the regularization. i would suggest a slightly different denominator (1e-2 or 1e-3) but as the influence on the final results seem to be minimal, i'd trust christoph's judgement
model_pars <- get_model_pars(genes_step1, bin_size, umi, model_str, cells_step1, method, data_step1, theta_given, show_progress)
inf_genes <- which(model_pars[,1]==Inf)
inf_means <- 10^(genes_log_gmean_step1[names(inf_genes)])
model_pars[inf_genes,1] <- inf_means / 1e-5
Yeah... I'm playing around with some data sets and am starting to think that it's probably better to move away from directly regularizing theta (or log10(theta)), but instead transform to some relative measure of overdispersion before regularization
the problem is not glmGamPoi specific, though, so I'll address that separately
I'll stick to 1e-5 for now
FYI, here are the parameters for pbmc_3k for the following three methods A) poisson + theta.ml B) glm.nb C) glmGamPoi
On a single core, runtimes with default parameters are 54s, 303s, 27s
glmGamPois could be much faster if I called it on blocks of genes and not on individual genes - that's something I want to change
@yuhanH Thank you for this PR! Instead of fixing the conflicts and then adding the changes discussed here, I decided to add things myself in a9e1314
I add glmGamPoi to run negative binomial regression, and it can increase sctransform speed significantly. For the pbmc3k, it decreases the time from 53s to 18s, and residuals don't change. Because the glmGamPoi is not in CRAN, I add glmGamPoi to Suggests in DESCRIPTION