m-clark / mixed-models-with-R

Covers the basics of mixed models, mostly using @lme4
https://m-clark.github.io/mixed-models-with-R/
130 stars 41 forks source link

Any idea if bigger data can be used with Julia (or Python) vs R? #24

Closed AdrianAntico closed 2 years ago

AdrianAntico commented 2 years ago

Great book! Just finished it up. In the appendix you mentioned that Python and Julia offer a subset of modeling options that R does. You also mentioned that you've been able to build some models in R with over a million records. Do you know if Julia (or Python) can process larger amounts of data than R for running these types of models or any run time advantages? I would like to think Julia could but I'm not sure what the bottleneck is in the training of these models or to what extent there are multi-threading opportunities... Again, great book!

m-clark commented 2 years ago

Thanks, and glad you enjoyed it!

I haven't played with Julia's MixedModels anytime recently, but I do follow its development and know that it is still actively developed (by one of the primary authors of R's lme4). I'd be surprised if it has a notable gain, and if so, why it wouldn't also be applied to lme4 also given the development entanglement, but that is based on very dated knowledge at this point. From my recollections of lme4 issues/development, it's difficult to parallelize some of the problem parts, and besides that, very little of lme4 is even in R to begin with, so aside from parallelization it's not going to get much faster than the underlying C code and computational tricks employed there.

For Python I'm mostly aware of the statsmodels implementation (my former boss contributed a lot to it, but now seems to be playing with Julia 😄), and I'm not aware of any speed advantage for any statsmodels implementations over R, and I want to say that in previous testing (now also long ago) it seemed a bit slower in fact.

Other alternatives would something like doing a GLM in pytorch or similar, the main difference being that the fixed effects are penalized by default as well, but you could fiddle with that. The larger issue would be doing anything with it after the fact, as you would lack all the easy diagnostic and model exploration tools, but that too could be overcome, and may be worth it if the data is extremely large.

AdrianAntico commented 2 years ago

Thanks for the responses! Given the data size limitations I assume sampling is a natural next step. Do you have any advice or resources about optimal sampling strategies for these types of models? For example, would it be better to sample a list of levels from the random effects and then make sure to include full history of them, or run some sort of stratified sampling over the random effect levels?

m-clark commented 2 years ago

I don't have any real intuition there, though in the past I think I have sampled to include all levels since problems may arise when not including all levels depending on the tool used, but that's not necessarily a good reason. I'd maybe suggest consulting with someone more well versed in survey sampling type of approaches for an alternative view.

But I came across this yesterday regarding Julia MixedModels, which sounds great, so maybe try it first! Though I'm not convinced entirely on it being faster than lme4, and I'd be concerned that one tool converges and another doesn't, it sounds like Julia worked well in their very large situation (millions with at least two random effects).

m-clark commented 2 years ago

Thanks for the link. I'll have to test both out compare. Do you know if there is a way to run the models and to have them calculate only what's necessary for the random effects estimates?

Not that I know of. If there is a hack, you might contact Bolker of lme4, he's pretty good about responding on SO and GitHub.

AdrianAntico commented 2 years ago

@m-clark I figured out a method for generating predictions for a couple of model structures thus far. The biggest data I was able to use, given a 256GB memory limit, is a 500m record data set, with two random effects: one with 50m levels and the second with 5m levels. Took 3.9 minutes to run and return the predictions. Do you think there would be some interest in that?

AdrianAntico commented 2 years ago

@m-clark I went ahead and submitted an issue to the lme4 github repo and tagged Bolker. I'll keep you posted about any responses, if you're interested.

m-clark commented 2 years ago

Definitely keep me posted, sounds great!

AdrianAntico commented 2 years ago

Here's a link to the issue. I shared the idea with some supporting materials.

https://github.com/lme4/lme4/issues/696

bbolker commented 2 years ago

I am interested in issues of performance on large data sets (despite my lack of interest in incorporating the particular proposal in the lme4 package). I would be happy to work on developing a list of references/pointers to go in the GLMM FAQ if that is an appropriate venue, or elsewhere ...

From the questions above it seems you are most interested in fast fitting/inference on in-memory data, rather than recipes that will handle out-of-memory data (map/reduce etc.), typically with a corresponding cost in time ...

Some possibly interesting references:


Gao, Katelyn. “Scalable Estimation and Inference for Massive Linear Mixed Models with Crossed Random Effects.” PhD Thesis, Stanford University, 2017. https://statweb.stanford.edu/~owen/students/KatelynGaoThesis.pdf.

Gao, Katelyn, and Art Owen. “Efficient Moment Calculations for Variance Components in Large Unbalanced Crossed Random Effects Models.” Electronic Journal of Statistics 11, no. 1 (2017): 1235–96. https://doi.org/10.1214/17-EJS1236.

Gao, Katelyn, and Art B. Owen. “Estimation and Inference for Very Large Linear Mixed Effects Models.” Statistica Sinica, 2020. https://doi.org/10.5705/ss.202018.0029.

“Diamond: Python Solver for Mixed-Effects Models.” Python. 2017. Reprint, Stitch Fix Technology, September 20, 2017. https://github.com/stitchfix/diamond.

Sweetser, Tim, and Aaron Bradley. “Diamond Part II: Stitch Fix Technology.” Stitch Fix Technology: Multithreaded, August 7, 2017. https://multithreaded.stitchfix.com/blog/2017/08/07/diamond2/.

AdrianAntico commented 2 years ago

My thinking is that fast inference can open doors for mixed effects frameworks that don't exist today. Consider the "Follow the Regularization Leader" modeling framework that Google and others have. I would think mixed effects frameworks could be beneficial to systems like these. Considering how much revenue is generated in the world of ad tech, I would think there would be some research dollars behind exploring these methods (assuming they haven't already).

The distributed computing aspect sounds interesting. I would think something like that would be needed to run a model, such as, predicting the effects of covid vaccinations on people.

Anyway, thanks for the references. I'm going to dig into them and see what kind of performance improvements I squeeze out of them.