Saving `lm` model via vetiver object takes a lot of space

lschneiderbauer commented 11 months ago

Hi,

Thank you for putting effort into trying to make live easier for ML people. :) I am just experimenting with the vetiver package to see if we can make use of it, and am stumbling over some issues.

I set up a simple tidymodels workflow, fitted some data (~ 14 mio records), created a vetiver object and tried to persist it with vetiver_pin_write(). The problem I have is that the result takes ~ 1.4 GB on my hard disk.

Is this intentional? In our use case we really only need the (stored) model to make predictions and provide confidence intervals. For that storing the training coefficients and associated uncertainties should be enough, and I don't see why that should take 1.4 GB of space.

I tried to experiment with the model = FALSE parameter for lm(), but that only reduced the filesize by half or so. It seems it has something to do with some fit$qr$qr object inside the fit model. I can manually remove that, and the filesize gets to an acceptible size, but neither vetiver nor butcher do so automatically.

Do I have to live with the fact that the trained models will take a big amount of space or are there some measures I can take to get it to a size of the order of a couple of KB?

juliasilge commented 11 months ago

This is a great question @lschneiderbauer. It's more about butcher than vetiver, so I will plan to move this issue over there. You can see what specifically we remove from an lm() model here, and notice that we don't remove the qr component. The reason is that component is needed for generating prediction intervals, which is something we typically want to retain for models.

I wonder if we should consider two levels of butchering, one that retains the ability to make all kinds of predictions and one that is less conservative and only retains the ability to make a very simple prediction.

In the meantime, if I were you, I would probably use the butcher infrastructure to remove the components you want before creating a vetiver model, something like this:

library(butcher)
library(vetiver)

more_cars <- mtcars[rep(1:32, each = 1e4),]
cars_lm <- lm(mpg ~ ., data = more_cars)
weigh(cars_lm)
#> # A tibble: 25 × 2
#>    object         size
#>    <chr>         <dbl>
#>  1 qr.qr         54.0 
#>  2 residuals     28.4 
#>  3 fitted.values 28.4 
#>  4 effects        5.12
#>  5 model.mpg      2.56
#>  6 model.cyl      2.56
#>  7 model.disp     2.56
#>  8 model.hp       2.56
#>  9 model.drat     2.56
#> 10 model.wt       2.56
#> # ℹ 15 more rows

axe_custom <- function(x) {
    old <- x
    ## you probably don't want residuals either:
    x <- butcher:::exchange(x, "residuals", numeric(0))
    x$qr <- butcher:::exchange(x$qr, "qr", matrix(0))
    x
}

axed_lm <- axe_custom(cars_lm)
weigh(axed_lm)
#> # A tibble: 25 × 2
#>    object         size
#>    <chr>         <dbl>
#>  1 fitted.values 28.4 
#>  2 effects        5.12
#>  3 model.mpg      2.56
#>  4 model.cyl      2.56
#>  5 model.disp     2.56
#>  6 model.hp       2.56
#>  7 model.drat     2.56
#>  8 model.wt       2.56
#>  9 model.qsec     2.56
#> 10 model.vs       2.56
#> # ℹ 15 more rows

v <- vetiver_model(axed_lm, "custom-butchered-lm")
weigh(v)
#> # A tibble: 37 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 model.effects     5.12
#>  2 model.model.mpg   2.56
#>  3 model.model.cyl   2.56
#>  4 model.model.disp  2.56
#>  5 model.model.hp    2.56
#>  6 model.model.drat  2.56
#>  7 model.model.wt    2.56
#>  8 model.model.qsec  2.56
#>  9 model.model.vs    2.56
#> 10 model.model.am    2.56
#> # ℹ 27 more rows

^{Created on 2023-11-30 with reprex v2.0.2}

juliasilge commented 11 months ago

Oops no, I can't transfer an issue from the rstudio org to the tidymodels org. I'll open a new issue over there.

juliasilge commented 11 months ago

Please feel free to add any details over at tidymodels/butcher#272 @lschneiderbauer! 🙌

rstudio / vetiver-r

Saving `lm` model via vetiver object takes a lot of space #264