tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
571 stars 113 forks source link

step_corr fails to remove high correlation column #1355

Closed grouptheory closed 3 months ago

grouptheory commented 3 months ago

Minimal example:

df <- data.frame(x1=runif(10)) %>% 
  mutate(x2=x1+1) %>% 
  mutate(y=x1+rnorm(10))

cor(df)

rec <- recipe(y~x1+x2, data = df) %>%
  step_corr(threshold=0.9) %>%
  prep(df)

bake(rec, new_data=df)
EmilHvitfeldt commented 3 months ago

cross-posted from https://stackoverflow.com/questions/78834269/tidymodels-step-corr-fails-to-remove-highly-correlated-columns

You forgot to selector variables in step_corr(). All steps allow for empty selections which does nothing

library(recipes)

df <- data.frame(x1=runif(10)) %>% 
  mutate(x2=x1+1) %>% 
  mutate(y=x1+rnorm(10))

cor(df)
#>           x1        x2         y
#> x1 1.0000000 1.0000000 0.6882089
#> x2 1.0000000 1.0000000 0.6882089
#> y  0.6882089 0.6882089 1.0000000

rec <- recipe(y~x1+x2, data = df) %>%
  step_corr(all_predictors(), threshold=0.9) %>%
  prep(df)

bake(rec, new_data=df)
#> # A tibble: 10 × 2
#>       x2      y
#>    <dbl>  <dbl>
#>  1  1.06 -0.353
#>  2  1.53 -0.951
#>  3  1.87  2.51 
#>  4  1.43 -0.288
#>  5  1.60  0.696
#>  6  1.64  0.296
#>  7  1.31  1.16 
#>  8  1.07 -1.37 
#>  9  1.49 -0.215
#> 10  1.70  1.16

Created on 2024-08-05 with reprex v2.1.0

github-actions[bot] commented 3 months ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.