Closed SantiagoD999 closed 7 months ago
Hello @SantiagoD999 👋
At prediction time in {workflows} only variables marked as "predictor" will be added to the preprocessor, such as recipes. If you really want to scale the outcome, you can scale it separately beforehand.
library(tidymodels)
set.seed(1)
n<-1000
y<-numeric(n)
x1<-rnorm(n)
x2<-rnorm(n)
x3<-rnorm(n)
x4<-rnorm(n)
e<-rnorm(n)
y<-2000+3*x1+8*x2+9*x3+8*x4+e
DATA<-tibble(y,x1,x2,x3,x4)
TRAIN<-DATA[1:800,]
TEST<-DATA[801:n,]
recipe_outcome <- recipe(y~., data=TRAIN) %>%
step_normalize(all_outcomes()) %>%
prep()
TRAIN <- bake(recipe_outcome, TRAIN)
# This is only needed because we know TEST has an outcome.
# On future data this will not be needed
TEST <- bake(recipe_outcome, TEST)
recipe_norm<-recipe(y~.,data=TRAIN) %>%
step_normalize(all_predictors())
mlp_norm<-workflow() %>%
add_model(mlp(epochs=tune()) %>% set_engine("nnet") %>% set_mode("regression")) %>%
add_recipe(recipe_norm)
mlp_resample<-vfold_cv(TRAIN, v = 5)
mlp_tune <- tune_grid(mlp_norm, mlp_resample, grid = 5,control=control_grid(save_pred = TRUE),metrics = metric_set(rmse))
mlp_norm_fit<-mlp_norm |>
finalize_workflow(select_best(mlp_tune, "rmse")) |>
fit(TRAIN)
augment(mlp_norm_fit,new_data=TEST)
#> # A tibble: 200 × 6
#> x1 x2 x3 x4 y .pred
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -1.09 0.714 -2.10 -0.728 -1.49 -1.38
#> 2 -1.83 0.581 -0.0844 -0.247 -0.149 -0.219
#> 3 0.995 -0.147 0.756 -0.614 0.208 0.227
#> 4 -0.0119 1.51 -1.58 0.104 0.0199 -0.0561
#> 5 -0.600 -0.280 0.707 -0.801 -0.277 -0.257
#> 6 -0.178 2.03 -1.05 1.32 1.08 1.07
#> 7 -0.426 -1.20 0.259 0.0312 -0.474 -0.518
#> 8 0.997 1.31 -0.00168 -0.824 0.499 0.436
#> 9 0.728 -0.524 -1.18 -0.867 -1.16 -1.24
#> 10 -1.73 0.354 1.74 -1.25 0.230 0.188
#> # ℹ 190 more rows
Thank you for your response. I have another question, wouldn't outcome's scaling beforehand produce data leakage in the resample process as I would be using the outcome's entire training set instead of the available fold in the standardization?
Good question. I don't think it is. Data leakage is when informations about the relationship between outcome and predictors is improperly handled. So that the model appears to have better performance than it actually has.
Adding and dividing the outcome by the same numbers in all parts of the pipeline doesn't do that. imagine adding 10 to the outcome, it should have no effect on the performance.
I see your point, thank you for the clarification.
No problem!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
Good morning, I am using your excellent tidymodels package and have came across an issue. Thank you.
The problem
I'm having trouble with using recipes when the objective is to scale both the features and the target, choosing a model after tuning a parameter. I want to use the scaled target both for the tuning and subsequent prediction. I have used
skip=TRUE
but this then makes the tuning process ignore that I want the target to be scaled.Reproducible example