stevenpawley / colino

Recipes Steps for Supervised Filter-Based Feature Selection
https://stevenpawley.github.io/colino/
Other
38 stars 6 forks source link

colino

The goal of colino is to provide supervised feature selection steps to be used with the tidymodels recipes package. The overall focus of the package is on filter-based feature selection methods. Permutation score methods that use a model can be considered a special case of filter approaches.

Note - colino is the new package name and replaces the preliminary 'recipeselectors' name. Colino will be submitted to CRAN once some additional steps and documentation have been finalized.

Installation

devtools::install_github("stevenpawley/colino")

Feature Selection Methods

The following feature selection methods are implemented:

Feature Selection Criteria

Three parameters are used to filter features within the step_select_ functions in colino:

Note that top_p and threshold are mutually exclusive but either can be used in conjunction with cutoff to select the top-ranked features and those that have filter scores that meet the cutoff threshold. For example, you can require at least three features to be included by using top_n = 3 but also include any other features that meet the cutoff criteria, e.g., cutoff = 0.01 if a method uses p-value units.

Most step_select_ steps have top_p, threshold and cutoff available but a few methods such as Boruta and FCBF do not rank the features, but only provide a list of rejected features. These methods typically only have none of these arguments, or only cutoff.

Notes

The step_select_vip is designed to work with the parsnip package and requires a base model specification that provides a method of ranking the importance of features, such as feature importance scores or coefficients, with one score per feature. The base model is specified in the step using the model parameter.

Although step_select_vip allows a diverse range of models to be used as the ranking algorithm, and potentially allows new models to be implemented, a limitation is that the hyperparameters of the ranking model cannot be tuned. As an alternative, step_select_linear, step_select_tree and step_select_forests provide steps specific to these types of models where the hyperparameters of ranking model can be tuned using the same tuning arguments as parsnip.

The parsnip package does not currently contain a method of pulling feature importance scores from models that support them. The colino package provides a generic function pull_importances for this purpose that accepts a fitted parsnip model, and returns a tibble with two columns 'feature' and 'importance':

model <- boost_tree(mode = "classification") %>%
  set_engine("xgboost")

model_fit <- model %>% 
  fit(Species ~., iris)

pull_importances(model_fit)

Most of the models and 'engines' that provide feature importances are implemented. In addition, h2o models are supported using the agua package. Use methods(pull_importances) to list models that are currently implemented. If need to pull the feature importance scores from a model that is not currently supported in this package, then you can add a class to the pull_importances generic function which returns a two-column tibble:

pull_importances._ranger <- function(object, scaled = FALSE, ...) {
  scores <- ranger::importance(object$fit)

  # create a tibble with 'feature' and 'importance' columns
  scores <- tibble::tibble(
    feature = names(scores),
    importance = as.numeric(scores)
  )

  # optionally rescale the importance scores
  if (scaled)
    scores$importance <- scales::rescale(scores$importance)
  scores
}

An example of using the step_importance function:

library(parsnip)
library(recipes)
library(magrittr)

# load the example iris dataset
data(iris)

# define a base model to use for feature importances
base_model <- rand_forest(mode = "classification") %>%
  set_engine("ranger", importance = "permutation")

# create a preprocessing recipe
rec <- iris %>%
recipe(Species ~ .) %>%
step_select_vip(all_predictors(), model = base_model, top_p = 2,
                outcome = "Species")

prepped <- prep(rec)

# create a model specification
clf <- decision_tree(mode = "classification") %>%
set_engine("rpart")

clf_fitted <- clf %>%
  fit(Species ~ ., juice(prepped))