rstudio / bundle

Prepare objects for serialization with a consistent interface
https://rstudio.github.io/bundle/
Other
27 stars 4 forks source link

Clarify what you can expect to do after bundling, i.e. `predict` #50

Closed ClaudiuPapasteri closed 1 month ago

ClaudiuPapasteri commented 1 year ago

I am not sure if this a known issue, as it doesn't appear in the docs. It seems that except predict, other methods like tidy or rank_results fail using the unbundled object. This SO post references the same problem.

library(tidymodels)
library(agua)
h2o_start()

data(concrete)
set.seed(4595)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)

auto_spec <-
  auto_ml() %>%
  set_engine("h2o", max_runtime_secs = 120, seed = 1) %>%
  set_mode("regression")

normalized_rec <-
  recipe(compressive_strength ~ ., data = concrete_train) %>%
  step_normalize(all_predictors())

auto_wflow <-
  workflow() %>%
  add_model(auto_spec) %>%
  add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)

# Save
auto_fit <- fit(auto_wflow, data = concrete_train)
auto_fit_bundle <- bundle(auto_fit)
saveRDS(auto_fit_bundle, file = "test.h2o.auto_fit.rds") #save the object

# Load
auto_fit_bundle <- readRDS("test.h2o.auto_fit.rds")
auto_fit <- unbundle(auto_fit_bundle)

rank_results(auto_fit)
tidy(auto_fit)

Error in UseMethod("rank_results") : no applicable method for 'rank_results' applied to an object of class "c('H2ORegressionModel', 'H2OModel', 'Keyed')"

juliasilge commented 1 year ago

That's true, yep! The focus of bundle is to capture the references needed by a model to make predictions in a new environment. For more info, you can look at:

I would generally expect functions like tidy() and rank_results() to be called during model development, and not so much during model deployment. Can you share a bit more about your use case?

ClaudiuPapasteri commented 1 year ago

Thank you for the helpful reply, I suspected this was the case and the links you shared made it much clearer. Unfortunately, although the scope of the bundle package should be clear for everyone, possible affordances of the post-bundle object (except for prediction from it) are not so obvious (for me, at least). Maybe it would be helpful to state this more clearly in the documentation. Any way, thank you guys for the awesome package ecosystems, and thank you Julia, your work and talks inspired and helped me throughout my data journey. It's an honor ...

juliasilge commented 1 year ago

Thank you so much for the kind words! ❤️

Let's keep this issue open and clarify some of the documentation about what you can expect to do after bundling, especially in the README and main vignette.

(As a side note, I also maintain butcher and this is about the same as how butcher works. Sometimes we keep components in butcher that are needed for something like predict(interval="prediction") but not just your typical predictions.)

Steviey commented 5 months ago

Can we use pkg: bundle to:

refering to:

https://rstudio.github.io/bundle/

https://rstudio.github.io/bundle/articles/bundle.html

https://rstudio.github.io/bundle/reference/bundle_h2o.html

juliasilge commented 5 months ago

@Steviey The normal usage that we expect after bundling is to predict with your model, but if can get out the parsnip object, you should be able to refit:

library(bundle)
library(parsnip)
library(callr)

## bundle a model
mod <-
    boost_tree(trees = 5, mtry = 3) %>%
    set_mode("regression") %>%
    set_engine("xgboost") %>%
    fit(mpg ~ ., data = mtcars[1:25,])

bundled_mod <- bundle(mod)

## fit the model to new data
r(
  func = function(bundled_mod) {
    library(bundle)
    library(parsnip)

    unbundled_mod <- unbundle(bundled_mod)
    fittable_model <- extract_spec_parsnip(unbundled_mod)
    fittable_model |> fit(mpg ~ ., data = mtcars[26:32,])
  },
  args = list(
    bundled_mod = bundled_mod
  )
)
#> parsnip model object
#> 
#> ##### xgb.Booster
#> Handle is invalid! Suggest using xgb.Booster.complete
#> raw: 7.7 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 0.3, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 5, watchlist = x$watchlist, 
#>     verbose = 0, nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "0.3", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> callbacks:
#>   cb.evaluation.log()
#> # of features: 10 
#> niter: 5
#> nfeatures : 10 
#> evaluation_log:
#>   iter training_rmse
#>  <num>         <num>
#>      1     16.923941
#>      2     12.953166
#>      3     10.022720
#>      4      7.801856
#>      5      6.089100

Created on 2024-03-29 with reprex v2.1.0

Steviey commented 5 months ago

@juliasilge Thank you Julia. Extract_spec_parsnip() returns a parsnip model specification. Does this include hyperparameters from earlier trainings and fits before bundleling? Would this include sub models from the leaderboard of a h2o AutoML-model?

juliasilge commented 5 months ago

@Steviey Hmmmm, I am not entirely sure as I don't have a ton of experience with H2O. I think a good venue for this kind of question is the agua repo: https://github.com/tidymodels/agua

Steviey commented 5 months ago

@juliasilge Thank you Julia for the response. Since the h2o-issue goes deeper to h2o itself, mentioned for example here: https://github.com/business-science/modeltime.h2o/issues/14 I would guess this is still not really resolved, after some years. So my hope was pkg. bundle. would do the job entirely.

More in general related to tidymodels (other models then h2o): I 'm mainly interested in refitting on new data- but with earlier searched hyperparameters. Let's say I train a model and search for hyperparameters on one day, bundle and save the model or workflow etc. for later use and then the next day unbundle and refit on new/more data. Can we then utilize the efforts/compute time from the day before, namely the best hyperparameters searched before bundleling? Are they included in the bundle for later use? Or do we have to save and retrieve that stuff separately?

This could be an ecological question too (green ML/AI).

Maybe related: https://github.com/tidymodels/tune/issues/84

If bundle requires separat actions in this regard, I m not sure if this is still best practice:

exec(update, object = tree_mod, !!!final_param)

juliasilge commented 5 months ago

@Steviey The bundle package can handle bundling up the needed references but doesn't have functionality for getting the best hyperparameters; you'd need to get that through tidymodels infrastructure in either tune or agua. Once you have those hyperparameters, then definitely bundle will work. 👍

Steviey commented 5 months ago

@juliasilge OK, then I would bet on finalize more then on update.

simonpcouch commented 1 month ago

Feels worth mentioning that the Value documentation for each bundle method states:

The output of unbundle() is a model object that is ready to predict() on new data, and other restored functionality (like plotting or summarizing) is supported as a side effect only.

I would argue that this is sufficient to set expectations for what users can do with unbundled objects. :)

juliasilge commented 1 month ago

That's a great point @simonpcouch. 👍

We haven't heard a lot of other confusion on this point to date, so let's close this as complete. We can revisit in the future as necessary!