Feature suggestion: Extract splits from tune results as a resampling object #947

Open jrosell opened 6 days ago

jrosell commented 6 days ago

Feature suggestion

Now that we have the new {tailor} package for post-processing in titydmodels, I find myself in the need to reuse the splits from tune_results as a resampling object.

I believe this new extract_resamples function (or whatever name you prefer) could improve the interactive usage of tidymodels.

Here a minimal reproducible example to demonstrate its use:

# pak::pak(
#   paste0(
#     "tidymodels/",
#     c("tune", "workflows", "rsample", "tailor")
#   )
# )
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#>     as.factor, as.ordered

# How well are our predictions calibrated?  Not so well
delivery_split <- initial_split(deliveries)
delivery_train <- training(delivery_split)
delivery_test  <- testing(delivery_split)
delivery_folds <- vfold_cv(delivery_train)
delivery_res <-
  workflow() %>%
  add_formula(time_to_delivery ~ .) %>%
  add_model(boost_tree(mode = "regression", trees = 3)) |> 
    control = control_stack_resamples()
delivery_res |> 
  collect_predictions() |> 
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

delivery_res |> collect_metrics()
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   9.52     10 0.0533  Preprocessor1_Model1
#> 2 rsq     standard   0.853    10 0.00357 Preprocessor1_Model1

# We want to reuse the already saved splits in the tune results as rset
extract_resamples <- \(x) {
  stopifnot(inherits(x, "tune_results"))
  result_rset <- manual_rset(x$splits, x$id)
  new_attrs <- attributes(result_rset)[c("names", "row.names")]
  existing_attrs <- attributes(x)$rset_info$att
  att <- modifyList(existing_attrs, new_attrs)
  desired_classes <- c(att$class, "rset", "tbl_df", "tbl", "data.frame")  
  att$class <- NULL  
  attributes(result_rset) <- att  
  class(result_rset) <- desired_classes
waldo::compare(delivery_folds, extract_resamples(delivery_res))
#> ✔ No differences

# Let's adjust numeric calibration extracting the saved splits
delivery_res_improved <-
  delivery_res |> 
  extract_workflow() |> 
  add_tailor(tailor() %>% adjust_numeric_calibration()) |> 
    control = control_stack_resamples()
delivery_res_improved |> collect_metrics()
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   2.71     10 0.0300  Preprocessor1_Model1
#> 2 rsq     standard   0.846    10 0.00432 Preprocessor1_Model1

# Much better
delivery_res_improved |> 
  collect_predictions() |>
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

This implementation seems to give identical results for my vfold_cv example, but I guess other rset type of objects should be tested.

simonpcouch commented 4 days ago

Could you say a little bit more about why it is that you'd need to extract the splits from the tune_results rather than just reusing the splits you have already?

Note to self: FWIW, we did find a use for a similar helper in stacks:::.set_splits().

jrosell commented 4 days ago

Well. In my pipelines I usually have one process for fitting resamples & tuning and sometimes I only save the tune_resamples object and not the rset... But, then "ups" I need the rset too because I want to check something and I didnt save it. {tailor} could increase the probability of this issue.

Furthermore, I want to try AutoGuon inference approach and this function could help.