tidymodels / bonsai

parsnip wrappers for tree-based models
https://bonsai.tidymodels.org
Other
51 stars 7 forks source link

Things to keep in mind when saving lightgbm models? #44

Closed dpprdan closed 2 years ago

dpprdan commented 2 years ago

I've had some issues with saving and {butcher}ing (reducing file size of saved model) xgboost models via tidymodels some months ago. What this came down to (IIUC) is that tidymodels does not support native serialization of those models at the moment.

Is that something I would have worry about when working with {bonsai}/{lightgbm} as well? Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?

simonpcouch commented 2 years ago

Thanks for the issue!

Is that something I would have worry about when working with {bonsai}/{lightgbm} as well?

You'll indeed need to keep an eye out for lack of native serialization support for lightgbm in bonsai.

We're actively working on better infrastructure for supporting native serialization methods. That experimental work currently lives at rstudio/bundle [edit: changed URL] if you'd like to follow our development, but we hope to integrate this functionality under the hood in objects outputted by tidymodels / vetiver soon. I'd anticipate this work to reach our CRAN packages before the end of the year.πŸ‘

Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?

I'm not sure I'd put forth a "recommended approach" for nowβ€”the somewhat hacky approach of

saveRDS(bonsai_fit, path1)
saveRDS.lgb.Booster(extract_fit_engine(bonsai_fit), path2)
bonsai_fit_read <- readRDS(path1)
bonsai_fit_engine_read <- readRDS.lgb.Booster(path2)
bonsai_fit_read$fit <- bonsai_fit_engine_read

works, but is quite painful. At the same time, our approach with bundle

bonsai_fit_bundled <- bundle(bonsai_fit)
saveRDS(bonsai_fit_bundled, path1)

bonsai_fit_read <- readRDS(path1)
bonsai_fit_new <- unbundle(bonsai_fit_bundle)

works but is still experimental/unstable, and may just happen under the hood here soon. Whichever feels better for you is fine for now, though we hope to confidently recommend the latter soon. :)

Related to https://github.com/tidymodels/butcher/issues/147, https://github.com/tidymodels/parsnip/issues/779, https://github.com/tidymodels/stacks/issues/145.

Again, thanks for bringing this up.πŸ„β€β™€οΈ

dpprdan commented 2 years ago

Thank you very much @simonpcouch! This looks great already. For now I think I'll go with the first "painful" approach, since that looks like it could still work even if {bundle} is in effect under the hood. πŸ€”

dpprdan commented 2 years ago

I tried to create a reprex with the two approaches and manually saving the lgbm object ("painful" approach) does not work for me, i.e. it throws an error on predict() that the workflow is not yet fit().

library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(bonsai)
data(ames)

## build model

ames <-
  ames |>
  select(
    Sale_Price,
    Neighborhood,
    Gr_Liv_Area,
    Year_Built,
    Bldg_Type,
    Latitude,
    Longitude
  ) |> 
  mutate(Sale_Price = log10(Sale_Price))

spec <- 
  boost_tree() |> 
  set_engine("lightgbm") |> 
  set_mode("regression")

rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_dummy(all_nominal_predictors()) 

wf <- 
  workflow() |> 
  add_model(spec) |> 
  add_recipe(rec)

ft <- fit(wf, ames)

## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 Γ— 1
#>   .pred
#>   <dbl>
#> 1  5.24

## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit <- readRDS.lgb.Booster("ft_engine.rds")

predict(ft_read, ames[1,])
#> Error in `extract_fit_parsnip()`:
#> ! Can't extract a model fit from an untrained workflow.
#> β„Ή Do you need to call `fit()`?

## using {bundle} works
library(bundle)
ft |> bundle() |> saveRDS("ft_bndl.rds")
ft_bndl_read <- readRDS("ft_bndl.rds") |> unbundle()
predict(ft_bndl_read, ames[1,])
#> # A tibble: 1 Γ— 1
#>   .pred
#>   <dbl>
#> 1  5.24
Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language en #> collate German_Germany.utf8 #> ctype German_Germany.utf8 #> tz Europe/Berlin #> date 2022-08-08 #> pandoc 2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> bonsai * 0.1.0 2022-06-23 [1] CRAN (R 4.2.1) #> broom * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1) #> bundle * 0.0.0.9200 2022-08-08 [1] Github (simonpcouch/bundle@77d630c) #> class 7.3-20 2022-01-16 [2] CRAN (R 4.2.1) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> data.table 1.14.2 2021-09-27 [1] CRAN (R 4.2.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> dials * 1.0.0 2022-06-14 [1] CRAN (R 4.2.0) #> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> furrr 0.3.0 2022-05-04 [1] CRAN (R 4.2.0) #> future 1.27.0 2022-07-22 [1] CRAN (R 4.2.1) #> future.apply 1.9.0 2022-04-25 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1) #> ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0) #> globals 0.15.1 2022-06-24 [1] CRAN (R 4.2.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gower 1.0.0 2022-02-03 [1] CRAN (R 4.2.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0) #> hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.2.1) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.1) #> infer * 1.0.2 2022-06-10 [1] CRAN (R 4.2.0) #> ipred 0.9-13 2022-06-02 [1] CRAN (R 4.2.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0) #> jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0) #> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1) #> lava 1.6.10 2021-09-02 [1] CRAN (R 4.2.0) #> lhs 1.1.5 2022-03-22 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> lightgbm * 3.3.2 2022-01-14 [1] CRAN (R 4.2.1) #> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.0) #> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.1) #> Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.1) #> modeldata * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> nnet 7.3-17 2022-01-16 [2] CRAN (R 4.2.1) #> parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.1) #> parsnip * 1.0.0 2022-06-16 [1] CRAN (R 4.2.0) #> pillar 1.8.0 2022-07-18 [1] CRAN (R 4.2.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.2.0) #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.1) #> R6 * 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.1) #> recipes * 1.0.1 2022-07-07 [1] CRAN (R 4.2.1) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.1) #> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0) #> rpart 4.1.16 2022-01-24 [2] CRAN (R 4.2.1) #> rsample * 1.0.0 2022-06-24 [1] CRAN (R 4.2.1) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> scales * 1.2.0 2022-04-13 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> survival 3.3-1 2022-03-03 [2] CRAN (R 4.2.1) #> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.1) #> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.1) #> tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> timeDate 4021.104 2022-07-19 [1] CRAN (R 4.2.1) #> tune * 1.0.0 2022-07-07 [1] CRAN (R 4.2.1) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> workflows * 1.0.0 2022-07-05 [1] CRAN (R 4.2.1) #> workflowsets * 1.0.0 2022-07-12 [1] CRAN (R 4.2.1) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> yardstick * 1.0.0 2022-06-06 [1] CRAN (R 4.2.0) #> #> [1] C:/Users/Daniel.AK-HAMBURG/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.1/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

I am a bit worried about bundle still being experimental, so ideally, I'd like the more verbose but stable way to work as well.

simonpcouch commented 2 years ago

Sure thing! Thanks for the reprex.

Since you're fitting with a workflow rather than a plain parsnip model spec, that original lgb.Booster fit object lives in the $fit$fit$fit slot rather than $fit. With your reprex:

library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(bonsai)
data(ames)

## build model

ames <-
  ames |>
  select(
    Sale_Price,
    Neighborhood,
    Gr_Liv_Area,
    Year_Built,
    Bldg_Type,
    Latitude,
    Longitude
  ) |> 
  mutate(Sale_Price = log10(Sale_Price))

spec <- 
  boost_tree() |> 
  set_engine("lightgbm") |> 
  set_mode("regression")

rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_dummy(all_nominal_predictors()) 

wf <- 
  workflow() |> 
  add_model(spec) |> 
  add_recipe(rec)

ft <- fit(wf, ames)

## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 Γ— 1
#>   .pred
#>   <dbl>
#> 1  5.24

## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit$fit$fit <- readRDS.lgb.Booster("ft_engine.rds")

predict(ft_read, ames[1,])
#> # A tibble: 1 Γ— 1
#>   .pred
#>   <dbl>
#> 1  5.24

Created on 2022-08-08 by the reprex package (v2.0.1)

jameslamb commented 2 years ago

Just want to add to this conversation that since December 2021, {lightgbm}'s development version has supported using readsRDS() / saveRDS() directly for {lightgbm} models: https://github.com/microsoft/LightGBM/pull/4685

Sorry that that hasn't made it into a CRAN release yet. You can subscribe to https://github.com/microsoft/LightGBM/issues/5153 to be notified when that happens.

Just mentioning it because if using a development version of {lightgbm} built from source is an option (which I do understand is kind of painful), it might remove the need for other workarounds.

simonpcouch commented 2 years ago

Thanks for the note here, @jameslamb! Hadn't noticed that PR. Will consider that when figuring out our approach here / in bundle.

dpprdan commented 2 years ago

that original lgb.Booster fit object lives in the $fit$fit$fit slot rather than $fit

haha, I tried $fit$fit before but not $fit$fit$fit. πŸ˜‚

I think I might just go with the dev/4.0 version of {lightgbm}. πŸ˜‰

simonpcouch commented 2 years ago

An update from the bundle side:

We've opted to remove the lightgbm bundle method in light of that upcoming feature in lightgbm. This should "just work" in good time. :)

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.