stevenpawley / recipeselectors

Additional recipes for supervised feature selection to be used with the tidymodels recipes package
https://stevenpawley.github.io/recipeselectors/
Other
55 stars 7 forks source link

general thoughts on feature selection in tidymodels #1

Open topepo opened 4 years ago

topepo commented 4 years ago

I'm debating on where to include supervised feature selection inside of tidymodels. Should they be inside of recipes? I'll brainstorm out loud here; pardon my unsolicited ramblings.

pro-recipes pov:

con-recipes pov:

Originally I had thought up a filter specification (sort of like a recipe) that would define statistics (e.g. p-values, summary stats like ROC, model importance) and then rules to combine them ("ROC > .8 or being in the top 3 RF importance scores). This would get included in a workflow and executed accordingly.

This method would be very expressive but yet another specification for users to fill out. That's why I haven't worked on it further.

Switching gears, here are some specific design thoughts for this package:

stevenpawley commented 4 years ago

Hi Max,

Thanks so much for sharing your thoughts on issues related to the greater architecture. I thought of the recipes approach because of (a) some quick feature selection needs related to project work, and (b) I also use scikit-learn regularly, where feature selection is performed similarly via transformers/pipelines. However, I'm not aware of a more sophisticated approach in sklearn in terms of tuning/selections between multiple transformers, other than simply allowing each transformer to be able to select all of the features as one of the hyperparameters. I think that mlr and mlr3 allow this.

Really appreciate the package-specific comments. I thought that target and num_comp were not the best terms but I didn't get a chance to look at alternatives. I actually started using vip over the last week and realized that it already implements most of the methods to extract model-based importances so I'll definitely switch to that.

Right now, the step_importance (which needs a better name) was effectively trying to mimic sklearn's SelectFromModel transformer that takes another model. However, the structure of a transformer and estimator are quite similar in scikit learn, and both are easily tuned/updated by setting their instance attributes including accessing estimator objects that are nested inside of them. Right now I need a feature selection step using model-based scores, but certainly can drop that in-time if a better structure is available.

My aims for the package were mostly to implement some of the common filter-based methods. However, as it's unlikely to address the other issues, are you open to taking ~5 more recipes within the recipes package, or would you prefer to keep additions such as these in a separate package (e.g. like themis)?

topepo commented 4 years ago

Right now, recipes is a good enough place to put these operations so go for it.

I think that keeping the filter steps in a separate package is a good idea. There might be a lot of steps.

I forked your repo and just committed some prototype code for an ROC filter (I kinda had this laying around). Here's an example of usage that emulates RFE:

library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.4      ✓ recipes   0.1.12
#> ✓ dials     0.0.6      ✓ rsample   0.0.6 
#> ✓ dplyr     0.8.5      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.0      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.0 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.5 
#> ✓ purrr     0.3.4
#> Warning: package 'rsample' was built under R version 3.6.2
#> Warning: package 'tibble' was built under R version 3.6.2
#> ── Conflicts ────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step()   masks stats::step()
library(recipeselectors)

data(cells, package = "modeldata")

cells <- cells %>% select(-case)

rec <-
  recipe(class ~ ., data = cells) %>%
  step_corr(all_predictors(), threshold = 0.9) %>% 
  step_select_roc(all_predictors(), outcome = "class", top_p = tune())

mod <- logistic_reg() %>% 
  set_engine("glm")

wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(mod)

p_info <- 
  wflow %>% 
  parameters() %>% 
  update(top_p = top_p(c(1, 30)))

rs <- vfold_cv(cells)

ctrl <- control_grid(extract = identity)

rfe_res <-
  mod %>% 
  tune_grid(
    rec,
    resamples = rs,
    grid = 20,
    param_info = p_info,
    control = ctrl
  )

rfe_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>% 
  ggplot(aes(x = top_p, y = mean)) + 
  geom_point() + 
  geom_line() + 
  theme_bw()

Created on 2020-05-06 by the reprex package (v0.3.0)

Some notable things:

Let me know what you think.