Closed lschneiderbauer closed 11 months ago
This is a great question @lschneiderbauer. It's more about butcher than vetiver, so I will plan to move this issue over there. You can see what specifically we remove from an lm()
model here, and notice that we don't remove the qr
component. The reason is that component is needed for generating prediction intervals, which is something we typically want to retain for models.
I wonder if we should consider two levels of butchering, one that retains the ability to make all kinds of predictions and one that is less conservative and only retains the ability to make a very simple prediction.
In the meantime, if I were you, I would probably use the butcher infrastructure to remove the components you want before creating a vetiver model, something like this:
library(butcher)
library(vetiver)
more_cars <- mtcars[rep(1:32, each = 1e4),]
cars_lm <- lm(mpg ~ ., data = more_cars)
weigh(cars_lm)
#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 qr.qr 54.0
#> 2 residuals 28.4
#> 3 fitted.values 28.4
#> 4 effects 5.12
#> 5 model.mpg 2.56
#> 6 model.cyl 2.56
#> 7 model.disp 2.56
#> 8 model.hp 2.56
#> 9 model.drat 2.56
#> 10 model.wt 2.56
#> # ℹ 15 more rows
axe_custom <- function(x) {
old <- x
## you probably don't want residuals either:
x <- butcher:::exchange(x, "residuals", numeric(0))
x$qr <- butcher:::exchange(x$qr, "qr", matrix(0))
x
}
axed_lm <- axe_custom(cars_lm)
weigh(axed_lm)
#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 fitted.values 28.4
#> 2 effects 5.12
#> 3 model.mpg 2.56
#> 4 model.cyl 2.56
#> 5 model.disp 2.56
#> 6 model.hp 2.56
#> 7 model.drat 2.56
#> 8 model.wt 2.56
#> 9 model.qsec 2.56
#> 10 model.vs 2.56
#> # ℹ 15 more rows
v <- vetiver_model(axed_lm, "custom-butchered-lm")
weigh(v)
#> # A tibble: 37 × 2
#> object size
#> <chr> <dbl>
#> 1 model.effects 5.12
#> 2 model.model.mpg 2.56
#> 3 model.model.cyl 2.56
#> 4 model.model.disp 2.56
#> 5 model.model.hp 2.56
#> 6 model.model.drat 2.56
#> 7 model.model.wt 2.56
#> 8 model.model.qsec 2.56
#> 9 model.model.vs 2.56
#> 10 model.model.am 2.56
#> # ℹ 27 more rows
Created on 2023-11-30 with reprex v2.0.2
Oops no, I can't transfer an issue from the rstudio org to the tidymodels org. I'll open a new issue over there.
Please feel free to add any details over at tidymodels/butcher#272 @lschneiderbauer! 🙌
Hi,
Thank you for putting effort into trying to make live easier for ML people. :) I am just experimenting with the
vetiver
package to see if we can make use of it, and am stumbling over some issues.I set up a simple tidymodels workflow, fitted some data (~ 14 mio records), created a vetiver object and tried to persist it with
vetiver_pin_write()
. The problem I have is that the result takes ~ 1.4 GB on my hard disk.Is this intentional? In our use case we really only need the (stored) model to make predictions and provide confidence intervals. For that storing the training coefficients and associated uncertainties should be enough, and I don't see why that should take 1.4 GB of space.
I tried to experiment with the
model = FALSE
parameter forlm()
, but that only reduced the filesize by half or so. It seems it has something to do with some fit$qr$qr object inside the fit model. I can manually remove that, and the filesize gets to an acceptible size, but neithervetiver
norbutcher
do so automatically.Do I have to live with the fact that the trained models will take a big amount of space or are there some measures I can take to get it to a size of the order of a couple of KB?