rstudio / bundle

Prepare objects for serialization with a consistent interface
https://rstudio.github.io/bundle/
Other
27 stars 4 forks source link

New problem with bundling `step_embed()` #54

Closed juliasilge closed 1 year ago

juliasilge commented 1 year ago

While working on #53 I found a new problem in bundling step_embed():

Sys.setenv(RETICULATE_PYTHON="/Users/juliasilge/miniforge3/envs/keras-connect/bin/python")

library(workflows)
library(parsnip)
library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(embed)

set.seed(1)
rec <-
  recipe(mpg ~ ., data = mtcars) |>
  step_umap(all_predictors(), outcome = vars(mpg), num_comp = 2)
mod <- workflow(rec, linear_reg()) |> fit(data = mtcars)

bundle::bundle(mod)
#> Error in `purrr::map()` at bundle/R/bundle_recipe.R:22:2:
#> ℹ In index: 1.
#> Caused by error in `all_nn_indices_are_loaded()`:
#> ! Invalid model: has no 'nn_index'
#> Backtrace:
#>      ▆
#>   1. ├─bundle::bundle(mod)
#>   2. ├─bundle:::bundle.workflow(mod) at bundle/R/bundle.R:20:2
#>   3. │ └─bundle::swap_element(res, "pre", "actions", "recipe", "recipe") at bundle/R/bundle_workflows.R:55:2
#>   4. │   ├─bundle::bundle(component) at bundle/R/utils.R:64:4
#>   5. │   └─bundle:::bundle.recipe(component) at bundle/R/bundle.R:20:2
#>   6. │     └─purrr::map(res$steps, bundle::bundle) at bundle/R/bundle_recipe.R:22:2
#>   7. │       └─purrr:::map_("list", .x, .f, ..., .progress = .progress)
#>   8. │         ├─purrr:::with_indexed_errors(...)
#>   9. │         │ └─base::withCallingHandlers(...)
#>  10. │         ├─purrr:::call_with_cleanup(...)
#>  11. │         ├─bundle (local) .f(.x[[i]], ...)
#>  12. │         └─bundle:::bundle.step_umap(.x[[i]], ...) at bundle/R/bundle.R:20:2
#>  13. │           └─uwot::save_uwot(umap_fit, file_loc) at bundle/R/bundle_embed.R:42:2
#>  14. │             └─uwot:::all_nn_indices_are_loaded(model)
#>  15. │               └─base::stop("Invalid model: has no 'nn_index'")
#>  16. └─base::.handleSimpleError(...)
#>  17.   └─purrr (local) h(simpleError(msg, call))
#>  18.     └─cli::cli_abort(...)
#>  19.       └─rlang::abort(...)

Created on 2023-08-21 with reprex v2.0.2

juliasilge commented 1 year ago

We can still bundle the prepped recipe, which seems weird:

Sys.setenv(RETICULATE_PYTHON="/Users/juliasilge/miniforge3/envs/keras-connect/bin/python")

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(embed)

rec <-
  recipe(mpg ~ ., data = mtcars) |>
  step_umap(all_predictors(), outcome = vars(mpg), num_comp = 2) |> 
  prep()

bundle::bundle(rec)
#> bundled recipe object.

Created on 2023-08-21 with reprex v2.0.2

simonpcouch commented 1 year ago

A couple more pieces of the puzzle from spending some time with this this afternoon:

That uwot:::all_nn_indices_are_loaded() helper is just looking for a nn_index slot inside of step$object. In hardhat:::mold_recipe_default_process(), the prepped recipe is dropped inside of the blueprint rather than the recipe action. So:

mod$pre$actions$recipe$recipe$steps[[1]]$object
#> NULL

mod$pre$mold$blueprint$recipe$steps[[1]]$object
#> redacted for brevity...
#> 
#> $n_neighbors
#> [1] 15
#> 
#> $nn_index
#> $nn_index$ann
#> C++ object <0x11e50d9e0> of class 'AnnoyEuclidean' <0x12fca5f60>
#> 
#> $nn_index$type
#> [1] "annoyv1"
#> 
#> $nn_index$metric
#> [1] "euclidean"
#>
#> ...redacted for brevity

I think the former is supposed to contain the unprepped recipe while the latter is supposed to contain the prepped one? If so, I think removing

https://github.com/rstudio/bundle/blob/1498c8822c2d07bce6ee534b3fef8b9bbd78e22a/R/bundle_workflows.R#L55

and

https://github.com/rstudio/bundle/blob/1498c8822c2d07bce6ee534b3fef8b9bbd78e22a/R/bundle_workflows.R#L62

would resolve the issue, as we shouldn't need to worry about serialization of unprepped steps. A PR suggesting that change is incoming. :)

juliasilge commented 1 year ago

we shouldn't need to worry about serialization of unprepped steps

YES, makes sense 👍