stevenpawley / recipeselectors

Additional recipes for supervised feature selection to be used with the tidymodels recipes package
https://stevenpawley.github.io/recipeselectors/
Other
55 stars 7 forks source link

Issues when tuning parameters from 3 different sources #4

Closed lg1000 closed 2 years ago

lg1000 commented 2 years ago

As I will show in a reprex below, I got some issues, tuning model arguments, and recipe arguments (from recipes and recipeselectors) both, by merging the grids. I tried numerous was, but always get the error message:preprocessor 3/3: Error: You cannot prep() a tuneable recipe. Argument(s) with tune(): 'top_p'. Do you want to use a tuning function such as tune_grid()?

If I tune all the model and recipe arguments except top_p, it all works fine. How can I understand this issue?

#### LIBS

suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(recipeselectors))

#### DATA

df <- fread("Churn_Modelling.csv") # source: https://www.kaggle.com/shrutimechlearn/churn-modelling

set.seed(31)

split <- initial_split(df, prop = 0.8)
train <- training(split)
test <- testing(split)

k_folds_data <- vfold_cv(training(split), v = 10)

#### FEATURES 

# Define the recipe for Up-Sampling
rec <- recipe(Exited ~ ., data = train) %>%
    step_rm(one_of("RowNumber", "Surname")) %>%
    update_role(CustomerId, new_role = "Helper") %>%
    step_num2factor(all_outcomes(),
                    levels = c("No", "Yes"),
                    transform = function(x) {x + 1}) %>%
    step_normalize(all_numeric(), -has_role(match = "Helper")) %>%
    step_dummy(all_nominal(), -all_outcomes()) %>%
    step_nzv(all_predictors()) %>%
    themis::step_upsample(Exited) %>%
    step_other(all_nominal(), threshold = tune("cat_thresh")) %>% 
    step_corr(all_predictors(), threshold = tune("thresh_cor")) %>% 
    #step_pca(all_numeric(), -all_outcomes(), num_comp = tune())
    step_select_roc(all_predictors(), outcome = "Exited", top_p = tune())

#### MODEL

model_metrics <- metric_set(roc_auc)            

# xgboost model
xgb_spec <- boost_tree(
    trees = tune(), 
    tree_depth = tune(), min_n = tune(), 
    loss_reduction = tune(),                    
    sample_size = tune(), mtry = tune(),         
    learn_rate = tune(),                        
    stop_iter = tune()
) %>% 
    set_engine("xgboost") %>% 
    set_mode("classification")

# grid
xgb_grid <- grid_latin_hypercube(
    trees(),
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), train),
    learn_rate(),
    stop_iter(range = c(5L,50L)),
    size = 10
)

rec_grid <- grid_latin_hypercube(
    parameters(rec) %>% 
        update(top_p = top_p(c(1,11))) ,
    size = 10
)

comp_grid <- merge(xgb_grid, rec_grid)

# tune
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
model_res <- tune_grid(xgb_spec, preprocessor = rec,
                       resamples = k_folds_data,
                       grid = comp_grid,
                       metrics = model_metrics)
stopCluster(cl)
stevenpawley commented 2 years ago

Hello, unfortunately I don't have a lot of time to fully look into the details right now. However, from a quick run using your code (but not in parallel) a few issues came to mind. The first is that top_p appears to be attempting to select more predictors than what are available/remaining based on your previous recipe steps, particularly step_corr. Second, I don't think that this is the cause of the error, but I think that step_other should be placed before step_dummy, otherwise you won't have any factor variables left to pool, because you already have converted them all to dummy encoded variables. Your code ran fine for me when I omitted step_corr. It might also be that the recipeselectors package is not being exported to the cluster when running in parallel, so you could try explicitly exporting it using the 'pkgs' argument within control_grid.

lg1000 commented 2 years ago

Thanks a lot! I omitted step_corr and used the pkgs argument as you proposed and now it works. Now I will try to achieve the same with the finetune package