tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
575 stars 111 forks source link

persist local vairables used in recipe #751

Closed SlowMo24 closed 3 years ago

SlowMo24 commented 3 years ago

In situations when a local variable is used in a recipe (e.g. for cleaner code) that is not available during data processing, bake() will fail:

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(recipes)
#> 
#> Attache Paket: 'recipes'
#> Das folgende Objekt ist maskiert 'package:stats':
#> 
#>     step

input<-data.frame(x=c("foo","bar"),y=c(1,2))
search_str<-"foo"
model_recipe <- recipe(y~.,data=input)%>%
  step_mutate_at(x,fn=
                   list(
                     bool=~grepl(search_str,.,fixed=TRUE)
                   )
  )
trained_recipe<-model_recipe%>%
  prep()

trained_recipe%>%
  bake(new_data=NULL)
#> # A tibble: 2 x 3
#>   x         y x_bool
#>   <fct> <dbl> <lgl> 
#> 1 foo       1 TRUE  
#> 2 bar       2 FALSE

remove(search_str)

input_two<-data.frame(x=c("foo","bar"),y=c(1,2))

trained_recipe%>%
  bake(new_data=input_two)
#> Error: Problem with `mutate()` column `x_bool`.
#> ℹ `x_bool = (structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
#> x Objekt 'search_str' nicht gefunden

I see no advantage in this flexibility as recipies is about consistency. Using local variables that may have different format or content during prep and bake may lead to undesired side-effects. Wouldn't it be better to 'codify' the variables' content in the trained recipe for maximum transferability?

It may even be a new feature to offer variable injection during bake if desired.

EmilHvitfeldt commented 3 years ago

In this example, you could create the function you pass to step_mutate_at() instead of defining it in place to avoid this issue. The reason for your error right here is that you technically didn't pass search_str to the step, but used it in an anonymous function that then goes on to find it in the environment. Creating the functions fully and passing them in, will be more beneficial for you.

library(dplyr)
library(recipes)
input<-data.frame(x=c("foo","bar"),y=c(1,2))

foo_fun <- function(x) {
  grepl("foo",x,fixed=TRUE)
}
model_recipe <- recipe(y~.,data=input)%>%
  step_mutate_at(x,fn=
                   list(
                     bool=foo_fun
                   )
  )
trained_recipe<-model_recipe%>%
  prep()

trained_recipe%>%
  bake(new_data=NULL)
#> # A tibble: 2 x 3
#>   x         y x_bool
#>   <fct> <dbl> <lgl> 
#> 1 foo       1 TRUE  
#> 2 bar       2 FALSE

input_two<-data.frame(x=c("foo","bar"),y=c(1,2))

trained_recipe%>%
  bake(new_data=input_two)
#> # A tibble: 2 x 3
#>   x         y x_bool
#>   <fct> <dbl> <lgl> 
#> 1 foo       1 TRUE  
#> 2 bar       2 FALSE

Created on 2021-07-16 by the reprex package (v2.0.0)

juliasilge commented 3 years ago

You can use quasiquotation to embed variables into a prepped recipe for many steps, such as step_mutate():

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

input <- data.frame(x=c("foo","bar"),y=c(1,2))
search_str <- "foo"

model_recipe <- recipe(y ~ ., data = input) %>%
  step_mutate(bool = grepl(!!search_str, x, fixed=TRUE))

recipe_prep <- prep(model_recipe)
recipe_prep %>% bake(new_data = NULL)
#> # A tibble: 2 x 3
#>   x         y bool 
#>   <fct> <dbl> <lgl>
#> 1 foo       1 TRUE 
#> 2 bar       2 FALSE

remove(search_str)

input_two <- data.frame(x=c("foo","bar"),y=c(1,2))
recipe_prep %>% bake(new_data = input_two)
#> # A tibble: 2 x 3
#>   x         y bool 
#>   <fct> <dbl> <lgl>
#> 1 foo       1 TRUE 
#> 2 bar       2 FALSE

Created on 2021-07-16 by the reprex package (v2.0.0)

This may be a way for you to meet your analysis needs as well.

SlowMo24 commented 3 years ago

great, thank you for your answers! I don't see any need to further act on this issue.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.