Open marangiop opened 2 days ago
I don't know the distributed-memory systems well (barely at all).
lmmpar
doesn't look useful, at least as currently written (it does not do distributed-memory computing). photon-ml
looks like it might work, although from a quick look at its introductory materials I can't see whether it actually does shrinkage estimation for what it calls 'random effects' (the presentation doesn't go into enough technical detail and I don't want to dig deeper).sparklyr
and want help building the solution for yourself (starting here, I think) I will help you. (I could write an example that ran [for a model where the data did fit in memory] in parallel across shards, if you were willing to do the work of adapting it to Spark ...)
Hello,
I am working with a large dataset spread over 300 files. Here I am showing you an example where I load one of these files into R as a tibble.
After scaling some of the columns, I would then use the function
glmer
to fit the model as below:withkin_basic2<-glmer(withkin_ind ~ (1|combined_group)+(1|SAMPLE), na.action = na.omit, family = binomial(link = "logit"), weights=weight, data = elder_clean)
Since I am not able to load the 300 files in memory as a single tibble, I would like to implement this in a parallel fashion, namely using distributed memory computing (just like @bbolker suggested in the StackOverflow thread below from 2015). I have been googling on this topic and have come across some Github issues or Stackoverflow threads related to LMMs and GLMMs. Some of the threads below are quite old, yet I haven´t been able to find any clear code examples in R or Python detailing a potential solution for fitting a GLMM in a distributed memory fashion.
Resources
https://stackoverflow.com/questions/31452801/fitting-a-linear-mixed-model-to-a-very-large-data-set
https://github.com/lme4/lme4/issues/387
https://github.com/fulyagokalp/lmmpar
https://github.com/linkedin/photon-ml
Can somebody please advise? Is photon-ml the solution?
Just to clarify I am not doing a PhD :)