general thoughts on feature selection in tidymodels

topepo commented 4 years ago

I'm debating on where to include supervised feature selection inside of tidymodels. Should they be inside of recipes? I'll brainstorm out loud here; pardon my unsolicited ramblings.

pro-recipes pov:

Simple and already uses a specification that people know about (a recipe).
This lets the user define the pre-processing and filtering order. It is not obvious which should come first and may need some experimentation for each data set.

con-recipes pov:

Can't easily combine filters. For example, like with a volcano plot I might want to filter on statistical significance (e.g. p-value/FDR) and the size of a difference simultaneously. In this specific case, there would be a step that has these two criteria as arguments but, in general, more complex filter combinations would be difficult within a recipe.
We might have to repeat some computations (but potentially a lot). Take a "select the best X predictors by ROC score" scenario. We'd like to loop over models for each value of X so that we don't repeat the recipe execution when it is not needed. We have something like this in parsnip (via multi_predict()) but it can't be done inside of a recipe. Imagine a complex text recipe that does stemming, tokenization, and a bunch of heavy computations before the filter step. For each value of X that we want to search over, those computations get repeated. (I mention below that this may be better solved in a specific function for RFE).

Originally I had thought up a filter specification (sort of like a recipe) that would define statistics (e.g. p-values, summary stats like ROC, model importance) and then rules to combine them ("ROC > .8 or being in the top 3 RF importance scores). This would get included in a workflow and executed accordingly.

This method would be very expressive but yet another specification for users to fill out. That's why I haven't worked on it further.

Switching gears, here are some specific design thoughts for this package:

The package name is fairly general. Can you come up with something that is more about supervised feature selection (as opposed to just selection)?
Maybe we should have a specific naming convention for these steps (step_filter_*, step_select_* or something like that).
The filter steps might be parameterized so that the top n features are selected and/or via specific values (e.g. keep features with ROC values > 0.8 and the top 3 ROC features). This would avoid selecting out all of the features. Looking through some of the steps, you may already have that but, in some cases, one overrides the other. Maybe defaults as NA for both and users would have to fill at least one out. Filling two out would be an effective "or" unless otherwise noted.
Alternatively, if a filter excludes everything, you might want an option to always take the top score (even if it sucks).
For importance scores, we've been using the vip package so maybe importing that would be a good idea. They have the wrappers worked out already. This would also offset the number of dependencies for your package.
I think that the steps should mostly be filter methods (instead of wrappers). Some wrappers/algorithms (like RFE) could be done via the functions in tune. For more complex algorithms, I think that we would want functions that take a model workflow as input. Maybe functions like search_rfe(), search_sa(), etc. For the sake of package size, it might make sense for those to live in a separate package.
Some existing recipes use an argument name of outcome for specifying the outcome column.
Also, in terms of argument names, top_n or something similar is more generic than num_comp.
If threshold is an option to filter the importance scores, there should also be the option to standardize the score range (0-1 maybe?)

stevenpawley commented 4 years ago

Hi Max,

Thanks so much for sharing your thoughts on issues related to the greater architecture. I thought of the recipes approach because of (a) some quick feature selection needs related to project work, and (b) I also use scikit-learn regularly, where feature selection is performed similarly via transformers/pipelines. However, I'm not aware of a more sophisticated approach in sklearn in terms of tuning/selections between multiple transformers, other than simply allowing each transformer to be able to select all of the features as one of the hyperparameters. I think that mlr and mlr3 allow this.

Really appreciate the package-specific comments. I thought that target and num_comp were not the best terms but I didn't get a chance to look at alternatives. I actually started using vip over the last week and realized that it already implements most of the methods to extract model-based importances so I'll definitely switch to that.

Right now, the step_importance (which needs a better name) was effectively trying to mimic sklearn's SelectFromModel transformer that takes another model. However, the structure of a transformer and estimator are quite similar in scikit learn, and both are easily tuned/updated by setting their instance attributes including accessing estimator objects that are nested inside of them. Right now I need a feature selection step using model-based scores, but certainly can drop that in-time if a better structure is available.

My aims for the package were mostly to implement some of the common filter-based methods. However, as it's unlikely to address the other issues, are you open to taking ~5 more recipes within the recipes package, or would you prefer to keep additions such as these in a separate package (e.g. like themis)?

topepo commented 4 years ago

Right now, recipes is a good enough place to put these operations so go for it.

I think that keeping the filter steps in a separate package is a good idea. There might be a lot of steps.

I forked your repo and just committed some prototype code for an ROC filter (I kinda had this laying around). Here's an example of usage that emulates RFE:

library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.4      ✓ recipes   0.1.12
#> ✓ dials     0.0.6      ✓ rsample   0.0.6 
#> ✓ dplyr     0.8.5      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.0      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.0 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.5 
#> ✓ purrr     0.3.4
#> Warning: package 'rsample' was built under R version 3.6.2
#> Warning: package 'tibble' was built under R version 3.6.2
#> ── Conflicts ────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step()   masks stats::step()
library(recipeselectors)

data(cells, package = "modeldata")

cells <- cells %>% select(-case)

rec <-
  recipe(class ~ ., data = cells) %>%
  step_corr(all_predictors(), threshold = 0.9) %>% 
  step_select_roc(all_predictors(), outcome = "class", top_p = tune())

mod <- logistic_reg() %>% 
  set_engine("glm")

wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(mod)

p_info <- 
  wflow %>% 
  parameters() %>% 
  update(top_p = top_p(c(1, 30)))

rs <- vfold_cv(cells)

ctrl <- control_grid(extract = identity)

rfe_res <-
  mod %>% 
  tune_grid(
    rec,
    resamples = rs,
    grid = 20,
    param_info = p_info,
    control = ctrl
  )

rfe_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>% 
  ggplot(aes(x = top_p, y = mean)) + 
  geom_point() + 
  geom_line() + 
  theme_bw()

^{Created on 2020-05-06 by the reprex package (v0.3.0)}

Some notable things:

I used top_p. That could use a better name. I defaulted the parameter grid to be [1, 4] for now. If you want larger p, the parameter set can be updated or a custom grid can be given.
I kept the list of predictors to remove. I think in your code you save the ones retained. The downside to that is that you have to get fancy with the select statement in bake() because, if you just save that list, you would exclude columns that were not involved in the filter step.

You'll need some tunable() methods for the steps. I have an example in my commit.

Let me know what you think.

stevenpawley / recipeselectors

general thoughts on feature selection in tidymodels #1