Closed EmilHvitfeldt closed 2 years ago
We decided we can't remove the qr
or residuals
elements because it keeps folks from being able to do predict(se.fit = TRUE)
, which is a pretty common way to predict with glm()
.
Here is what we expect now from the glm method:
library(butcher)
sim_df <- modeldata::sim_regression(num_samples = 1e4)
sim_glm <- glm(outcome ~ ., data = sim_df)
sum(weigh(sim_glm)$size)
#> [1] 10.22058
weigh(sim_glm)
#> # A tibble: 83 × 2
#> object size
#> <chr> <dbl>
#> 1 qr.qr 2.32
#> 2 y 0.721
#> 3 residuals 0.720
#> 4 fitted.values 0.720
#> 5 linear.predictors 0.720
#> 6 weights 0.720
#> 7 prior.weights 0.720
#> 8 effects 0.162
#> 9 model.outcome 0.0800
#> 10 model.predictor_01 0.0800
#> # … with 73 more rows
butchered <- butcher(sim_glm)
sum(weigh(butchered)$size)
#> [1] 7.094472
weigh(butchered)
#> # A tibble: 63 × 2
#> object size
#> <chr> <dbl>
#> 1 qr.qr 2.32
#> 2 residuals 0.720
#> 3 linear.predictors 0.720
#> 4 weights 0.720
#> 5 prior.weights 0.720
#> 6 effects 0.162
#> 7 model.outcome 0.0800
#> 8 model.predictor_01 0.0800
#> 9 model.predictor_02 0.0800
#> 10 model.predictor_03 0.0800
#> # … with 53 more rows
Created on 2022-08-30 with reprex v2.0.2
This is how glm()
itself works 😔 and isn't due to using tidymodels:
sim_df <- modeldata::sim_regression(num_samples = 1e4)
sim_glm <- glm(outcome ~ ., data = sim_df)
lobstr::obj_size(sim_df)
#> 1.68 MB
lobstr::obj_size(sim_glm)
#> 5.77 MB
Created on 2022-08-30 with reprex v2.0.2
If you have very wide data, you might consider the "bad parts" of the R formula and use update_role()
instead of the formula interface in tidymodels.
Thanks for looking into this so thoroughly.
My data is wide, several hundred columns. I tried switching from using a formula to update_role()
... the file size remains quite large, but as you point out that's just an outcome of using glm()
. I'll experiment with other engines.
Thanks again!
Really wide data can be tough for many kinds of models including tree-based models. I think I'd look at something like lasso or ridge regularization in that situation, either with glmnet or LiblineaR. LiblineaR is really fast.
Good luck!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
I'm no longer able to reproduce the results shown in https://github.com/tidymodels/butcher/pull/212, despite no changes in the
glm.R
file.I found this problem when trying to answer this SO question https://stackoverflow.com/questions/73529453/file-size-of-tidymodels-workflow
Created on 2022-08-29 by the reprex package (v2.0.1)