Closed dpprdan closed 2 years ago
Thanks for the issue!
Is that something I would have worry about when working with {bonsai}/{lightgbm} as well?
You'll indeed need to keep an eye out for lack of native serialization support for lightgbm in bonsai.
We're actively working on better infrastructure for supporting native serialization methods. That experimental work currently lives at rstudio/bundle [edit: changed URL] if you'd like to follow our development, but we hope to integrate this functionality under the hood in objects outputted by tidymodels / vetiver soon. I'd anticipate this work to reach our CRAN packages before the end of the year.π
Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?
I'm not sure I'd put forth a "recommended approach" for nowβthe somewhat hacky approach of
saveRDS(bonsai_fit, path1)
saveRDS.lgb.Booster(extract_fit_engine(bonsai_fit), path2)
bonsai_fit_read <- readRDS(path1)
bonsai_fit_engine_read <- readRDS.lgb.Booster(path2)
bonsai_fit_read$fit <- bonsai_fit_engine_read
works, but is quite painful. At the same time, our approach with bundle
bonsai_fit_bundled <- bundle(bonsai_fit)
saveRDS(bonsai_fit_bundled, path1)
bonsai_fit_read <- readRDS(path1)
bonsai_fit_new <- unbundle(bonsai_fit_bundle)
works but is still experimental/unstable, and may just happen under the hood here soon. Whichever feels better for you is fine for now, though we hope to confidently recommend the latter soon. :)
Related to https://github.com/tidymodels/butcher/issues/147, https://github.com/tidymodels/parsnip/issues/779, https://github.com/tidymodels/stacks/issues/145.
Again, thanks for bringing this up.πββοΈ
Thank you very much @simonpcouch! This looks great already. For now I think I'll go with the first "painful" approach, since that looks like it could still work even if {bundle} is in effect under the hood. π€
I tried to create a reprex with the two approaches and manually saving the lgbm object ("painful" approach) does not work for me, i.e. it throws an error on predict()
that the workflow is not yet fit()
.
library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#>
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#>
#> slice
library(bonsai)
data(ames)
## build model
ames <-
ames |>
select(
Sale_Price,
Neighborhood,
Gr_Liv_Area,
Year_Built,
Bldg_Type,
Latitude,
Longitude
) |>
mutate(Sale_Price = log10(Sale_Price))
spec <-
boost_tree() |>
set_engine("lightgbm") |>
set_mode("regression")
rec <-
recipe(Sale_Price ~ ., data = ames) |>
step_dummy(all_nominal_predictors())
wf <-
workflow() |>
add_model(spec) |>
add_recipe(rec)
ft <- fit(wf, ames)
## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 Γ 1
#> .pred
#> <dbl>
#> 1 5.24
## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit <- readRDS.lgb.Booster("ft_engine.rds")
predict(ft_read, ames[1,])
#> Error in `extract_fit_parsnip()`:
#> ! Can't extract a model fit from an untrained workflow.
#> βΉ Do you need to call `fit()`?
## using {bundle} works
library(bundle)
ft |> bundle() |> saveRDS("ft_bndl.rds")
ft_bndl_read <- readRDS("ft_bndl.rds") |> unbundle()
predict(ft_bndl_read, ames[1,])
#> # A tibble: 1 Γ 1
#> .pred
#> <dbl>
#> 1 5.24
I am a bit worried about bundle still being experimental, so ideally, I'd like the more verbose but stable way to work as well.
Sure thing! Thanks for the reprex.
Since you're fitting with a workflow rather than a plain parsnip model spec, that original lgb.Booster
fit object lives in the $fit$fit$fit
slot rather than $fit
. With your reprex:
library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#>
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#>
#> slice
library(bonsai)
data(ames)
## build model
ames <-
ames |>
select(
Sale_Price,
Neighborhood,
Gr_Liv_Area,
Year_Built,
Bldg_Type,
Latitude,
Longitude
) |>
mutate(Sale_Price = log10(Sale_Price))
spec <-
boost_tree() |>
set_engine("lightgbm") |>
set_mode("regression")
rec <-
recipe(Sale_Price ~ ., data = ames) |>
step_dummy(all_nominal_predictors())
wf <-
workflow() |>
add_model(spec) |>
add_recipe(rec)
ft <- fit(wf, ames)
## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 Γ 1
#> .pred
#> <dbl>
#> 1 5.24
## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit$fit$fit <- readRDS.lgb.Booster("ft_engine.rds")
predict(ft_read, ames[1,])
#> # A tibble: 1 Γ 1
#> .pred
#> <dbl>
#> 1 5.24
Created on 2022-08-08 by the reprex package (v2.0.1)
Just want to add to this conversation that since December 2021, {lightgbm}
's development version has supported using readsRDS()
/ saveRDS()
directly for {lightgbm}
models: https://github.com/microsoft/LightGBM/pull/4685
Sorry that that hasn't made it into a CRAN release yet. You can subscribe to https://github.com/microsoft/LightGBM/issues/5153 to be notified when that happens.
Just mentioning it because if using a development version of {lightgbm}
built from source is an option (which I do understand is kind of painful), it might remove the need for other workarounds.
Thanks for the note here, @jameslamb! Hadn't noticed that PR. Will consider that when figuring out our approach here / in bundle.
that original
lgb.Booster fit
object lives in the$fit$fit$fit
slot rather than$fit
haha, I tried $fit$fit
before but not $fit$fit$fit
. π
I think I might just go with the dev/4.0 version of {lightgbm}. π
An update from the bundle side:
We've opted to remove the lightgbm bundle method in light of that upcoming feature in lightgbm. This should "just work" in good time. :)
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
I've had some issues with saving and {butcher}ing (reducing file size of saved model) xgboost models via tidymodels some months ago. What this came down to (IIUC) is that tidymodels does not support native serialization of those models at the moment.
Is that something I would have worry about when working with {bonsai}/{lightgbm} as well? Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?