Data_grid() passing a massive dataset to add_epred_draws()

mjskay / tidybayes

Bayesian analysis + tidy data + geoms (R package)

http://mjskay.github.io/tidybayes

GNU General Public License v3.0

710 stars 59 forks source link

Data_grid() passing a massive dataset to add_epred_draws() #319

Closed Erin-Fedewa-NOAA closed 3 months ago

Erin-Fedewa-NOAA commented 3 months ago

I'm not entirely sure that this warrants an issue, or is simply user error, but I always hit a road block when trying to pass a data_grid() prediction grid to add_predicted_draw() or add_epred_draws(). If I have multiple population-level effects in a model, I'm thrown errors if I don't include all the covariates in the data_grid() call, and then when I do so, I am stuck passing a massive dataset to the tidybayes functions. Just curious if there's a workaround for just defining a sequence range for one covariate in the data_grid() call without tidybayes functions throwing an error?

dat %>% data_grid(covariate1, covariate2, covariate3, covariate4) %>% #yikes, 153,754,848 rows! add_epred_draws(brmsmodoutput, re_formula = NA) #this is generally where R crash and burns

mjskay commented 3 months ago

Yeah, if you have a huge prediction grid the output of the function is necessarily going to be 153,754,848 rows x however many draws in the model == some huge long format data frame.

However, if you are just creating the huge long format data frame as an intermediate step (e.g. you are summarizing it down later), one way to solve this is to split up the input prediction grid into chunks, and pass each chunk to add_epred_draws and do the summarization, then combine the summaries. This avoids needing to create the huge long format table.

If you really want to get fancy you can set up a pipeline to do this in parallel using targets

Erin-Fedewa-NOAA commented 3 months ago

Thank you @mjskay! Just to clarify- when you say "split up the input prediction grid", you are referring to splitting up the brms model output object, correct? (i.e. brmsmodoutput in my example above). I was looking for a vignette demonstrating this but I'll play around and see if I can figure it out.

mjskay commented 3 months ago

Ah, I meant splitting up the output of data_grid()

Eg say you have three covariates a, b, c you can do data_grid(a = a1, b, c) and pipe that to add_epred_draws and do some summarization, then do the same with a2, etc (though not manually, with a loop or lapply or map or what have you). If that works you can do fancier stuff from there, like parallelizing it or using {targets}.

Erin-Fedewa-NOAA commented 3 months ago

Got it, thank you! Feel free to close this issue, I think I've got it from here