vincentarelbundock / marginaleffects

R package to compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds, etc.) for over 100 classes of statistical and ML models. Conduct linear and non-linear hypothesis tests, or equivalence tests. Calculate uncertainty estimates using the delta method, bootstrapping, or simulation-based inference
https://marginaleffects.com
Other
439 stars 45 forks source link

Reduce model object size without breaking marginaleffects functionality #1198

Closed RoelVerbelen closed 3 weeks ago

RoelVerbelen commented 1 month ago

I'm looking for some guidance on how to reduce a regression model's object size without impairing marginaleffects. Regression models in R tend to store the entire modelling data set as a list element in the resulting output. When working with large data sets that makes saving and loading (many iterations of) these models inefficient.

I've created a minimal reprex to illustrate how trying to limit the object size of regression models can lead to errors within marginaleffects, even though the modelling data set is not strictly necessary for the post-fit estimation analysis.

library(dplyr)
library(marginaleffects)
library(mgcv)
library(butcher)

reduce_glm_size <- function(object) {
  # Maintain the structure of the modelling data set but not all records
  object$model <- object$model[1, ]
  object
}

diamonds <- ggplot2::diamonds
model_data <- diamonds
fit <- gam(price ~ cut, data = model_data)
# Mimic loading the fitted model in new session such that the
# modelling data no longer exists with that name in the environment
rm(model_data)

avg_predictions(fit, newdata = diamonds[1:2, ], variables = as.list(distinct(diamonds, cut)))
#> 
#>        cut Estimate Std. Error     z Pr(>|z|)   S 2.5 % 97.5 %
#>  Ideal         3458       27.0 128.1   <0.001 Inf  3405   3510
#>  Premium       4584       33.8 135.8   <0.001 Inf  4518   4650
#>  Good          3929       56.6  69.4   <0.001 Inf  3818   4040
#>  Very Good     3982       36.1 110.4   <0.001 Inf  3911   4052
#>  Fair          4359       98.8  44.1   <0.001 Inf  4165   4552
#> 
#> Columns: cut, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

reduced_fit <- reduce_glm_size(fit)
avg_predictions(reduced_fit, newdata = diamonds[1:2, ], variables = as.list(distinct(diamonds, cut)))
#> Error: Some elements of the `variables` argument are not in their original
#>   data. Check this variable: cut

# sanitize_variables() within predictions() throws an error because of
# !all(as.character(predictors[[v]]) %in% as.character(modeldata[[v]]))

axed_fit <- axe_data(fit)
avg_predictions(axed_fit, newdata = diamonds[1:2, ], variables = as.list(distinct(diamonds, cut)))
#> Error in evalup(model[["call"]][["data"]]): object 'model_data' not found

# get_modeldata() within predictions() throws an error because 
# fit[["call"]][["data"]] is not found
vincentarelbundock commented 4 weeks ago

One way for you to investigate this is to make sure that your model object still works with all the functions from the insight package that marginaleffects call. For example, insight::get_data(), insight::find_variables(), etc.

RoelVerbelen commented 4 weeks ago

Thanks Vincent. The only way I see for insight::get_data() to keep working as we need it to is for the data frame to still exist in the global environment with the same name as it was referred to using model fitting. That's not really robust to rely on that in between different R sessions - one where model is fitted and one where it is evaluated (see above where I explicitly did rm(model_data ) to mimic this).

However, marginaleffects doesn't strictly need that data in order to do the post-estimation predictions. It only uses it for validating the argument inputs (here: whether all the factor values are observed in the model data). Ideally, I'd like to find a way to bypass that validation check so I can keep the model objects small in size. Perhaps by introducing a marginaleffects function argument / package option?

vincentarelbundock commented 4 weeks ago

@RoelVerbelen

I see why this would be useful.

I would be open to merging a PR which looks at ... for something named modeldata that pre-empts the need to call insight::get_data(). But to be frank, this is very low priority for me and I will not work on it unless the PR is very close to all finished, including coverage for all functions and some tests.

Also, I would only merge something like this if it requires very few lines of code; not willing to admit lots of code complexity. There's a chance this is much more complicated than we think, since insight::find_variables() may also rely on insight::get_data(). In that case, a simple PR may not do it.

Sorry to be so blunt. I see why this might be useful, but want to be fully transparent about parameters for a potential contribution.

vincentarelbundock commented 3 weeks ago

Closing to keep the issue tracker manageable, and because I don't intend to work on this in the short run. But, as stated, I am quite open to the idea, given an adequate implementation.

I listed this in the master thread for good ideas with no immediate fixes.

Thanks for raising!

RoelVerbelen commented 3 weeks ago

Thanks for considering it further down the road, @vincentarelbundock. In the short term, I'll be relying on loading the modelling data using the same name as when fitting the model object such that insight::get_data() gets it from the environment.

vincentarelbundock commented 3 weeks ago

thanks for the note. this sounds like a useful hack.