tidymodels / butcher

Reduce the size of model objects saved to disk
https://butcher.tidymodels.org/
Other
131 stars 12 forks source link

glm methods appears broken on dev #233

Closed EmilHvitfeldt closed 2 years ago

EmilHvitfeldt commented 2 years ago

I'm no longer able to reproduce the results shown in https://github.com/tidymodels/butcher/pull/212, despite no changes in the glm.R file.

I found this problem when trying to answer this SO question https://stackoverflow.com/questions/73529453/file-size-of-tidymodels-workflow

library(butcher)

more_cars <- mtcars[rep(1:32, each = 1000),]
cars_glm <- glm(mpg ~ ., data = more_cars)
weigh(cars_glm)
#> # A tibble: 63 × 2
#>    object             size
#>    <chr>             <dbl>
#>  1 qr.qr             5.36 
#>  2 y                 2.80 
#>  3 residuals         2.80 
#>  4 fitted.values     2.80 
#>  5 linear.predictors 2.80 
#>  6 weights           2.80 
#>  7 prior.weights     2.80 
#>  8 effects           0.513
#>  9 model.mpg         0.256
#> 10 model.cyl         0.256
#> # … with 53 more rows
#> # ℹ Use `print(n = ...)` to see more rows

butchered <- butcher(cars_glm)
sum(weigh(cars_glm)$size)
#> [1] 28.325
sum(weigh(butchered)$size)
#> [1] 19.91117
weigh(butchered)
#> # A tibble: 53 × 2
#>    object             size
#>    <chr>             <dbl>
#>  1 qr.qr             5.36 
#>  2 residuals         2.80 
#>  3 linear.predictors 2.80 
#>  4 weights           2.80 
#>  5 prior.weights     2.80 
#>  6 effects           0.513
#>  7 model.mpg         0.256
#>  8 model.cyl         0.256
#>  9 model.disp        0.256
#> 10 model.hp          0.256
#> # … with 43 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2022-08-29 by the reprex package (v2.0.1)

juliasilge commented 2 years ago

We decided we can't remove the qr or residuals elements because it keeps folks from being able to do predict(se.fit = TRUE), which is a pretty common way to predict with glm().

Here is what we expect now from the glm method:

library(butcher)
sim_df <- modeldata::sim_regression(num_samples = 1e4)
sim_glm <- glm(outcome ~ ., data = sim_df)
sum(weigh(sim_glm)$size)
#> [1] 10.22058
weigh(sim_glm)
#> # A tibble: 83 × 2
#>    object               size
#>    <chr>               <dbl>
#>  1 qr.qr              2.32  
#>  2 y                  0.721 
#>  3 residuals          0.720 
#>  4 fitted.values      0.720 
#>  5 linear.predictors  0.720 
#>  6 weights            0.720 
#>  7 prior.weights      0.720 
#>  8 effects            0.162 
#>  9 model.outcome      0.0800
#> 10 model.predictor_01 0.0800
#> # … with 73 more rows

butchered <- butcher(sim_glm)
sum(weigh(butchered)$size)
#> [1] 7.094472
weigh(butchered)
#> # A tibble: 63 × 2
#>    object               size
#>    <chr>               <dbl>
#>  1 qr.qr              2.32  
#>  2 residuals          0.720 
#>  3 linear.predictors  0.720 
#>  4 weights            0.720 
#>  5 prior.weights      0.720 
#>  6 effects            0.162 
#>  7 model.outcome      0.0800
#>  8 model.predictor_01 0.0800
#>  9 model.predictor_02 0.0800
#> 10 model.predictor_03 0.0800
#> # … with 53 more rows

Created on 2022-08-30 with reprex v2.0.2

This is how glm() itself works 😔 and isn't due to using tidymodels:

sim_df <- modeldata::sim_regression(num_samples = 1e4)
sim_glm <- glm(outcome ~ ., data = sim_df)
lobstr::obj_size(sim_df)
#> 1.68 MB
lobstr::obj_size(sim_glm)
#> 5.77 MB

Created on 2022-08-30 with reprex v2.0.2

If you have very wide data, you might consider the "bad parts" of the R formula and use update_role() instead of the formula interface in tidymodels.

jordanRupton commented 2 years ago

Thanks for looking into this so thoroughly.

My data is wide, several hundred columns. I tried switching from using a formula to update_role()... the file size remains quite large, but as you point out that's just an outcome of using glm(). I'll experiment with other engines.

Thanks again!

juliasilge commented 2 years ago

Really wide data can be tough for many kinds of models including tree-based models. I think I'd look at something like lasso or ridge regularization in that situation, either with glmnet or LiblineaR. LiblineaR is really fast.

Good luck!

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.