Add a function to get "final" dataset passed into model training or `predict`

TylerGrantSmith commented 2 years ago

There are many times where I might want to inspect or use the transformed data that gets passed directly into one of the model fitting functions (i.e. xgboost::xgboost), but there does not seem to be a defined way to get this 'final' transformed dataset from the workflow.

What I do currently is basically emulate what is happening in predict.workflow, but it seems like this should be encapsulated in its own exported function, because it requires the use of forge_predictors which is not currently exported.

library(tidymodels)

rec_spec <- recipe(mpg ~ ., data = mtcars)

xgb_spec <- boost_tree() %>% set_mode("regression")

xgb_wf <- workflow(rec_spec, xgb_spec)

xgb_fit <- fit(xgb_wf, data = mtcars)

# get matrix of data as it was passed into xgboost. 
# taken from workflows:::predict.workflow
xgb_matrix <- xgb_fit %>% 
  extract_fit_parsnip() %>% 
  prepare_data(workflows:::forge_predictors(mtcars, xgb_fit))

juliasilge commented 2 years ago

You can use extract_mold() to pull out the mold, like so:

library(tidymodels)

rec_spec <- recipe(Sepal.Length ~ ., data = iris) %>% step_dummy(all_nominal_predictors())
xgb_spec <- boost_tree() %>% set_mode("regression")
xgb_wf <- workflow(rec_spec, xgb_spec)
xgb_fit <- fit(xgb_wf, data = iris)

xgb_fit %>% extract_mold() %>% pluck("predictors")
#> # A tibble: 150 × 5
#>    Sepal.Width Petal.Length Petal.Width Species_versicolor Species_virginica
#>          <dbl>        <dbl>       <dbl>              <dbl>             <dbl>
#>  1         3.5          1.4         0.2                  0                 0
#>  2         3            1.4         0.2                  0                 0
#>  3         3.2          1.3         0.2                  0                 0
#>  4         3.1          1.5         0.2                  0                 0
#>  5         3.6          1.4         0.2                  0                 0
#>  6         3.9          1.7         0.4                  0                 0
#>  7         3.4          1.4         0.3                  0                 0
#>  8         3.4          1.5         0.2                  0                 0
#>  9         2.9          1.4         0.2                  0                 0
#> 10         3.1          1.5         0.1                  0                 0
#> # … with 140 more rows

^{Created on 2022-07-15 by the reprex package (v2.0.1)}

Notice that we have dummy variables for Species, because these are the transformed predictors.

TylerGrantSmith commented 2 years ago

Thanks @juliasilge , it is a little bit more verbose to apply the process to new data and complete the transformation including the "interface" conversion that prepare_data provides. It just seems like this should be an exported function that can take a fit object and data and output a fully formed dataset for input into the model. Perhaps I am alone with this issue?

library(tidymodels)

rec_spec <- recipe(Sepal.Length ~ ., data = iris) %>% step_dummy(all_nominal_predictors())
xgb_spec <- boost_tree() %>% set_mode("regression")
xgb_wf <- workflow(rec_spec, xgb_spec)
xgb_fit <- fit(xgb_wf, data = iris)

form_data <- function(object, new_data) {
  fit_parsnip <- extract_fit_parsnip(object)
  prepare_data(fit_parsnip, forge_predictors(new_data, object))
}
environment(form_data) <- getNamespace("workflows")

xgb_fit %>% form_data(head(iris))
#>      Sepal.Width Petal.Length Petal.Width Species_versicolor Species_virginica
#> [1,]         3.5          1.4         0.2                  0                 0
#> [2,]         3.0          1.4         0.2                  0                 0
#> [3,]         3.2          1.3         0.2                  0                 0
#> [4,]         3.1          1.5         0.2                  0                 0
#> [5,]         3.6          1.4         0.2                  0                 0
#> [6,]         3.9          1.7         0.4                  0                 0

DavisVaughan commented 2 years ago

I think the function you are really looking for is hardhat::forge(). You can do it in two lines of code when combined with extract_mold().

I don't think we want to make this any easier because I think we want people to treat the preprocessing + model fitting as a single workflow, and this short circuits at the end of the preprocessing part (which isn't a bad thing for debugging, but isn't something we want super visible).

library(tidymodels)

rec_spec <- recipe(Sepal.Length ~ ., data = iris) %>% step_dummy(all_nominal_predictors())
xgb_spec <- boost_tree() %>% set_mode("regression")
xgb_wf <- workflow(rec_spec, xgb_spec)
xgb_fit <- fit(xgb_wf, data = iris)

xgb_mold <- extract_mold(xgb_fit)

hardhat::forge(
  new_data = head(iris), 
  blueprint = xgb_mold$blueprint
)
#> $predictors
#> # A tibble: 6 × 5
#>   Sepal.Width Petal.Length Petal.Width Species_versicolor Species_virginica
#>         <dbl>        <dbl>       <dbl>              <dbl>             <dbl>
#> 1         3.5          1.4         0.2                  0                 0
#> 2         3            1.4         0.2                  0                 0
#> 3         3.2          1.3         0.2                  0                 0
#> 4         3.1          1.5         0.2                  0                 0
#> 5         3.6          1.4         0.2                  0                 0
#> 6         3.9          1.7         0.4                  0                 0
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> $extras$roles
#> NULL

^{Created on 2022-07-20 by the reprex package (v2.0.1)}

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / workflows

Add a function to get "final" dataset passed into model training or `predict` #159