lme4 / lme4

Mixed-effects models in R using S4 classes and methods with RcppEigen
Other
623 stars 148 forks source link

Massive-scale PQL with lmer()? #798

Open GabrielHoffman opened 5 months ago

GabrielHoffman commented 5 months ago

I work in genomics and I have over-dispersed count data on a massive scale. With 1M total samples of ~1K observations from ~1K individuals, fitting a negative binomial mixed model with glmer.nb() for at least 20K response variables is intractible. So I'm thinking about faster approximations. In the past I have used a normal approximation using observation weights and lmer() in a different context here. When counts are large, this is a good approximation. In the new case, counts are much smaller and often zero. I thought that glmmPQL() might be a good option. I see that it involves a sequence of calls to lme(), while lmer() tends to be faster.

1) Is there a reason it doesn't support lmer()? Could I rewrite it with lmer() and slightly different arguments, or am I missing something?

2) A little bit off track, but I thought I'd ask. PQL supports variance terms that are linear (i.e. $\tau\mu$) and quadratic (i.e. $\tau\mu^2$) functions of the mean. But a negative binomial has variance of $\mu + \mu^2/\theta$. Have you see this form included in the PQL context? Is the issue theoretical or implementation in estimating $\theta$?

Best, Gabriel

bbolker commented 5 months ago

Before this slips through the cracks.

GabrielHoffman commented 5 months ago

Thanks for the feedback, this was really helpful. I have a very fast implementation of a LMM for the special case with one random effect that scales to the million of tests I need to run. So I'm looking at how to adapt that to overdispersed count responses with a PQL approach. I'm happy to share the package when I'm a little farther along.

Do you have a good description of what glmer.nb() is doing in the backend? It seems like it 1) Fits a poisson GLMM with a Laplace approximation using glmer() 2) Estimates theta using est_theta() 3) Fits glmer() again this time with theta fixed

Is this a first principles approach, that would converge with steps 2 and 3 were iterated?

It seems like this logic could be adapted to a Poisson model fit with PQL. Do you have any concerns?

Best, Gabriel

bbolker commented 4 months ago

Sorry not to reply sooner. Yes, glmer.nb is calling optimize() to do a one-dimensional optimization on theta over calls to glmer() with different fixed values of theta. (glmer.nb -> lme4:::optTheta -> lme4:::refitNB).

For overdispersed count responses, if you're going to implement via PQL for speed, then I think you get the overdispersion/quasi-likelihood estimation part for free. Assuming that you can embed lmer into the equivalent framework used by MASS::PQL (and thereby speed it up), I don't think you need to do any iteration to get the overdispersion component.