Massive-scale PQL with lmer()?

GabrielHoffman commented 5 months ago

I work in genomics and I have over-dispersed count data on a massive scale. With 1M total samples of ~1K observations from ~1K individuals, fitting a negative binomial mixed model with glmer.nb() for at least 20K response variables is intractible. So I'm thinking about faster approximations. In the past I have used a normal approximation using observation weights and lmer() in a different context here. When counts are large, this is a good approximation. In the new case, counts are much smaller and often zero. I thought that glmmPQL() might be a good option. I see that it involves a sequence of calls to lme(), while lmer() tends to be faster.

1) Is there a reason it doesn't support lmer()? Could I rewrite it with lmer() and slightly different arguments, or am I missing something?

2) A little bit off track, but I thought I'd ask. PQL supports variance terms that are linear (i.e. $\tau\mu$) and quadratic (i.e. $\tau\mu^2$) functions of the mean. But a negative binomial has variance of $\mu + \mu^2/\theta$. Have you see this form included in the PQL context? Is the issue theoretical or implementation in estimating $\theta$?

Best, Gabriel

bbolker commented 5 months ago

Before this slips through the cracks.

you may well be better off switching to glmmTMB for this; glmmTMB is definitely faster for NB fits
in principle I don't see any reason you couldn't implement PQL with lmer at its core rather than lme (the algorithm/code is simple enough that it would probably be worth experimenting), although I haven't looked at the details.
the main reason 'vanilla' PQL wouldn't support the "nbinom2" variance ($\mu + \mu^2/\theta$) is the same as the reason that glm and glmer don't support nbinom2 but require extensions (MASS::glm.nb or lme4::glmer.nb); NB2 is not part of the exponential family unless $\theta$ is fixed; the standard iteratively reweighted least-squares (or penalized IRLS in the case of lme4) only works for exponential family distributions (and more specifically for the exponential dispersion families, ruling out e.g. Beta distributions). (More specifically, MASS::glmmPQL is exactly glm() + lme() under the hood, so it can only do what glm can ... I don't know if it could easily be extended to substitute MASS::glm.nb instead ...)
this is another advantage of glmmTMB (which is built on general MLE lines, not extensions of GLM/IRLS); it allows (much) more flexibility in response distributions, including "nbinom2", "nbinom1" ($V = \phi \mu$), and even recently ($V = \mu(a + \mu/b)$, combining these two as in DESeq2.
In the bioinformatics world I feel like people have come up with a lot of speed-optimized NB fitters, although I (1) can't name any off the top of my head (you might poke around on Bioconductor; I don't remember if any of them are in the mixed models task view and (2) I don't know if they would have the flexibility/all the other features you want

GabrielHoffman commented 5 months ago

Thanks for the feedback, this was really helpful. I have a very fast implementation of a LMM for the special case with one random effect that scales to the million of tests I need to run. So I'm looking at how to adapt that to overdispersed count responses with a PQL approach. I'm happy to share the package when I'm a little farther along.

Do you have a good description of what glmer.nb() is doing in the backend? It seems like it 1) Fits a poisson GLMM with a Laplace approximation using glmer() 2) Estimates theta using est_theta() 3) Fits glmer() again this time with theta fixed

Is this a first principles approach, that would converge with steps 2 and 3 were iterated?

It seems like this logic could be adapted to a Poisson model fit with PQL. Do you have any concerns?

Best, Gabriel

bbolker commented 4 months ago

Sorry not to reply sooner. Yes, glmer.nb is calling optimize() to do a one-dimensional optimization on theta over calls to glmer() with different fixed values of theta. (glmer.nb -> lme4:::optTheta -> lme4:::refitNB).

For overdispersed count responses, if you're going to implement via PQL for speed, then I think you get the overdispersion/quasi-likelihood estimation part for free. Assuming that you can embed lmer into the equivalent framework used by MASS::PQL (and thereby speed it up), I don't think you need to do any iteration to get the overdispersion component.

lme4 / lme4

Massive-scale PQL with lmer()? #798