stevenpawley / colino

Recipes Steps for Supervised Filter-Based Feature Selection
https://stevenpawley.github.io/colino/
Other
35 stars 5 forks source link

organizing the filter/ranking methods #8

Open topepo opened 12 months ago

topepo commented 12 months ago

Can we design some common infrastructure across filter methods? In other words the underlying filtering methods have parameters on the inputs (e.g. types of variables allowed) and the outputs (such as minimize/maximize), and so on? This is not unlike how yardstick organizes performance metrics.

One of the goals is to be able to make composite filters (e.g. maximize ROC AUC and pick the three largest importance scores). I have a private package that I've been kicking around for a while (ironically called colander - I'll send you an invite) that was a prototype for these types of filters.

If we had modular methods with control filter names, we could also reduce the total number of steps and have new ones that are flexible and work across all filter methods:

step_rank_predictors(
    all_predictors(),
    method = "rf_imp",
    top_p = 5
  )

# or 

step_filter_predictors(
    all_predictors(),
    filter = rf_imp > 3 & roc_auc >= .8
  )

I also have some working code to use desirability functions:

step_rank_desirability(
    all_predictors(),
    eqn = d_max(rf_imp, 0, 5) + d_min(pval_anova, -10, -1, scale = 1/2),
    top_p = 2
  )

The organizational parts in colander are not all that great right now but I think that the idea is a good one.

topepo commented 12 months ago

Here's a straw man constructor for new methods:

new_filter_method <- function(name, label, goal = "maximize", 
                              inputs = "all", outputs = "all", pkgs) {
  # name: a keyword used in other steps (e.g. rf_imp or similar)
  # label: for printing ("random forest variable importance")

  goal <- rlang::arg_match0(goal, c("maximize", "minimize", "zero", "target"))

  # Specifications for inputs and output variables
  # Maybe these should be more specific (e.g. "factor", "numeric", etc). 
  # Should also specify max levels for factor inputs or outputs? 
  inputs  <- rlang::arg_match0(inputs,  c("all", "qualitative", "quantitative"))
  outputs <- rlang::arg_match0(outputs, c("all", "qualitative", "quantitative"))

  # pkgs: character string of external packages used to compute the filter

  # maybe also set default arguments and a list that can't be altered by the user? 
  res <- 
    list(
      name = name,
      label = label,
      goal = goal,
      inputs = inputs,
      outputs = outputs,
      pkgs = pkgs
    )
  class(res) <- c(paste0("filter_method_", outputs), "filter_method")
  res
}
topepo commented 12 months ago

I had a long train ride and did a draft implementation (in topepo/colino fork) for a few methods:

library(tidymodels)
library(colino)     # remotes::install_github("topepo/colino")

tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)

data(cells)
cells$case <- NULL

fit_xy(
  colino:::filter_roc_auc,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                     score
#>    <chr>                        <dbl>
#>  1 fiber_width_ch_1             0.833
#>  2 total_inten_ch_2             0.805
#>  3 total_inten_ch_1             0.790
#>  4 shape_p_2_a_ch_1             0.786
#>  5 avg_inten_ch_2               0.777
#>  6 convex_hull_area_ratio_ch_1  0.772
#>  7 avg_inten_ch_1               0.760
#>  8 entropy_inten_ch_1           0.759
#>  9 convex_hull_perim_ratio_ch_1 0.747
#> 10 var_inten_ch_1               0.727
#> # ℹ 46 more rows

fit_xy(
  colino:::filter_mrmr,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                       score
#>    <chr>                          <dbl>
#>  1 total_inten_ch_4              0.644 
#>  2 entropy_inten_ch_1           -0.0736
#>  3 avg_inten_ch_2               -0.0740
#>  4 skew_inten_ch_4              -0.0754
#>  5 convex_hull_perim_ratio_ch_1 -0.0761
#>  6 shape_bfr_ch_1               -0.0764
#>  7 inten_cooc_contrast_ch_3     -0.0772
#>  8 eq_sphere_vol_ch_1           -0.0779
#>  9 spot_fiber_count_ch_4        -0.0783
#> 10 diff_inten_density_ch_1      -0.0801
#> # ℹ 46 more rows

fit_xy(
  colino:::filter_info_gain,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                      score
#>    <chr>                         <dbl>
#>  1 total_inten_ch_2             0.189 
#>  2 fiber_width_ch_1             0.174 
#>  3 avg_inten_ch_2               0.137 
#>  4 shape_p_2_a_ch_1             0.130 
#>  5 total_inten_ch_1             0.130 
#>  6 convex_hull_area_ratio_ch_1  0.112 
#>  7 avg_inten_ch_1               0.109 
#>  8 entropy_inten_ch_1           0.103 
#>  9 skew_inten_ch_1              0.0922
#> 10 convex_hull_perim_ratio_ch_1 0.0898
#> # ℹ 46 more rows

fit_xy(
  colino:::filter_info_gain_ratio,
  x = cells %>% select(-class),
  y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#>    variable                     score
#>    <chr>                        <dbl>
#>  1 total_inten_ch_2            0.158 
#>  2 fiber_width_ch_1            0.126 
#>  3 avg_inten_ch_2              0.106 
#>  4 total_inten_ch_1            0.0982
#>  5 shape_p_2_a_ch_1            0.0978
#>  6 convex_hull_area_ratio_ch_1 0.0855
#>  7 avg_inten_ch_1              0.0828
#>  8 fiber_length_ch_1           0.0823
#>  9 skew_inten_ch_1             0.0767
#> 10 entropy_inten_ch_1          0.0754
#> # ℹ 46 more rows

data(ames)
ames$Sale_Price <- log10(ames$Sale_Price)

num_col <- c("Longitude", "Latitude", "Year_Built", "Lot_Area", "Gr_Liv_Area")
fac_col <- c("MS_Zoning", "Central_Air", "Neighborhood")

fit_xy(
  colino:::filter_corr,
  x = ames %>% select(all_of(num_col)),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#>   variable    score
#>   <chr>       <dbl>
#> 1 Gr_Liv_Area 0.696
#> 2 Year_Built  0.615
#> 3 Longitude   0.292
#> 4 Latitude    0.286
#> 5 Lot_Area    0.255

fit_xy(
  colino:::filter_max_diff,
  x = ames %>% select(all_of(fac_col)),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 3 × 2
#>   variable     score
#>   <chr>        <dbl>
#> 1 MS_Zoning    0.814
#> 2 Neighborhood 0.531
#> 3 Central_Air  0.262

fit_xy(
  colino:::filter_rf_imp,
  x = ames %>% select(all_of(c(fac_col, num_col))),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 8 × 2
#>   variable     score
#>   <chr>        <dbl>
#> 1 Gr_Liv_Area  18.4 
#> 2 Year_Built   15.4 
#> 3 Longitude     7.86
#> 4 Latitude      6.25
#> 5 Lot_Area      5.53
#> 6 Central_Air   4.05
#> 7 Neighborhood  3.97
#> 8 MS_Zoning     3.91

fit_xy(
  colino:::filter_mic,
  x = ames %>% select(all_of(num_col)),
  y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#>   variable    score
#>   <chr>       <dbl>
#> 1 Longitude   0.463
#> 2 Gr_Liv_Area 0.441
#> 3 Year_Built  0.436
#> 4 Latitude    0.420
#> 5 Lot_Area    0.234

Created on 2023-07-10 with reprex v2.0.2

stevenpawley commented 11 months ago

Hi Max, many thanks for this - I wish my train rides were as productive! I've just started working through this - actually I wasn't aware of your desirability2 package - definitely will be looking at this particular for those MRMR type of cases.

Back to the filter/ranking methods - I can definitely can add these (and the remaining filter methods), so that the fit_xy generic can be used on any supplied filter, which maybe gets built into something like a step_filter_supervised for example.

Some other thoughts/ramblings are:

  1. How to specify / supply arguments to the methods in the same way as when called in their recipe steps, for example, mtry in a rf_imp filter? Most other ML libraries like sklearn or mlr3 allow tuning of almost everything, even it is creates some awkward syntax with those 'pipelinestepmodelnameparameter' sort of keys.
  2. There is also the idea of considering the choice of filtering method as a hyperparameter, i.e., choosing rf_imp vs. something else during tuning, but I guess that is a completely different issue and currently that could be performed via a workflowset (although more computationally expensive).
  3. How to reuse some of those components - currently each fit_xy method is essentially reimplementing each recipe step. I guess I should look at reversing this, so that each specific recipe step, like step_filter_infgain uses the fit_xy generic internally to avoid duplication.
  4. Trying to think about how many steps this would be applicable to? For example, are there methods that don't make sense to use with this approach, maybe for MRMR or Boruta? Or maybe that's fine - it is for the user to decide.
topepo commented 11 months ago

How to specify / supply arguments to the methods...

The underlying argument names are an open question for me. We can parameterize them for the individual filter functions. For a multi-method filter, I'm not sure the best way to specific them.

There is also the idea of considering the choice of filtering method as a hyperparameter,...

That's a great idea.

How to reuse some of those components...

I think that the prep() methods for the steps can call fit_xy().

Trying to think about how many steps this would be applicable to?...

I would try to do all of them (within reason).

I think that we can use this in different packages too. I plan on adding a recursive feature engineering function (maybe to finetune) and these would be useful.

topepo commented 11 months ago

I wish my train rides were as productive!

On the train ride back, I put the into a side-package that we could all use: https://github.com/topepo/filterdb

Some of these methods are based on what you first added so I planned on adding you as a contributor (if you want that).

I added some open questions/todo's in the package too. I'll convert these to issues this week.