Closed ClaudiuPapasteri closed 1 month ago
That's true, yep! The focus of bundle is to capture the references needed by a model to make predictions in a new environment. For more info, you can look at:
I would generally expect functions like tidy()
and rank_results()
to be called during model development, and not so much during model deployment. Can you share a bit more about your use case?
Thank you for the helpful reply, I suspected this was the case and the links you shared made it much clearer. Unfortunately, although the scope of the bundle
package should be clear for everyone, possible affordances of the post-bundle object (except for prediction from it) are not so obvious (for me, at least). Maybe it would be helpful to state this more clearly in the documentation.
Any way, thank you guys for the awesome package ecosystems, and thank you Julia, your work and talks inspired and helped me throughout my data journey. It's an honor ...
Thank you so much for the kind words! ❤️
Let's keep this issue open and clarify some of the documentation about what you can expect to do after bundling, especially in the README and main vignette.
(As a side note, I also maintain butcher and this is about the same as how butcher works. Sometimes we keep components in butcher that are needed for something like predict(interval="prediction")
but not just your typical predictions.)
Can we use pkg: bundle to:
refering to:
@Steviey The normal usage that we expect after bundling is to predict with your model, but if can get out the parsnip object, you should be able to refit:
library(bundle)
library(parsnip)
library(callr)
## bundle a model
mod <-
boost_tree(trees = 5, mtry = 3) %>%
set_mode("regression") %>%
set_engine("xgboost") %>%
fit(mpg ~ ., data = mtcars[1:25,])
bundled_mod <- bundle(mod)
## fit the model to new data
r(
func = function(bundled_mod) {
library(bundle)
library(parsnip)
unbundled_mod <- unbundle(bundled_mod)
fittable_model <- extract_spec_parsnip(unbundled_mod)
fittable_model |> fit(mpg ~ ., data = mtcars[26:32,])
},
args = list(
bundled_mod = bundled_mod
)
)
#> parsnip model object
#>
#> ##### xgb.Booster
#> Handle is invalid! Suggest using xgb.Booster.complete
#> raw: 7.7 Kb
#> call:
#> xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
#> colsample_bytree = 1, colsample_bynode = 0.3, min_child_weight = 1,
#> subsample = 1), data = x$data, nrounds = 5, watchlist = x$watchlist,
#> verbose = 0, nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#> eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "0.3", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> callbacks:
#> cb.evaluation.log()
#> # of features: 10
#> niter: 5
#> nfeatures : 10
#> evaluation_log:
#> iter training_rmse
#> <num> <num>
#> 1 16.923941
#> 2 12.953166
#> 3 10.022720
#> 4 7.801856
#> 5 6.089100
Created on 2024-03-29 with reprex v2.1.0
@juliasilge Thank you Julia. Extract_spec_parsnip() returns a parsnip model specification. Does this include hyperparameters from earlier trainings and fits before bundleling? Would this include sub models from the leaderboard of a h2o AutoML-model?
@Steviey Hmmmm, I am not entirely sure as I don't have a ton of experience with H2O. I think a good venue for this kind of question is the agua repo: https://github.com/tidymodels/agua
@juliasilge Thank you Julia for the response. Since the h2o-issue goes deeper to h2o itself, mentioned for example here: https://github.com/business-science/modeltime.h2o/issues/14 I would guess this is still not really resolved, after some years. So my hope was pkg. bundle. would do the job entirely.
More in general related to tidymodels (other models then h2o): I 'm mainly interested in refitting on new data- but with earlier searched hyperparameters. Let's say I train a model and search for hyperparameters on one day, bundle and save the model or workflow etc. for later use and then the next day unbundle and refit on new/more data. Can we then utilize the efforts/compute time from the day before, namely the best hyperparameters searched before bundleling? Are they included in the bundle for later use? Or do we have to save and retrieve that stuff separately?
This could be an ecological question too (green ML/AI).
Maybe related: https://github.com/tidymodels/tune/issues/84
If bundle requires separat actions in this regard, I m not sure if this is still best practice:
exec(update, object = tree_mod, !!!final_param)
@Steviey The bundle package can handle bundling up the needed references but doesn't have functionality for getting the best hyperparameters; you'd need to get that through tidymodels infrastructure in either tune or agua. Once you have those hyperparameters, then definitely bundle will work. 👍
@juliasilge OK, then I would bet on finalize more then on update.
Feels worth mentioning that the Value documentation for each bundle method states:
The output of unbundle() is a model object that is ready to predict() on new data, and other restored functionality (like plotting or summarizing) is supported as a side effect only.
I would argue that this is sufficient to set expectations for what users can do with unbundled objects. :)
That's a great point @simonpcouch. 👍
We haven't heard a lot of other confusion on this point to date, so let's close this as complete. We can revisit in the future as necessary!
I am not sure if this a known issue, as it doesn't appear in the docs. It seems that except
predict
, other methods liketidy
orrank_results
fail using the unbundled object. This SO post references the same problem.