Open topepo opened 1 year ago
Here's a straw man constructor for new methods:
new_filter_method <- function(name, label, goal = "maximize",
inputs = "all", outputs = "all", pkgs) {
# name: a keyword used in other steps (e.g. rf_imp or similar)
# label: for printing ("random forest variable importance")
goal <- rlang::arg_match0(goal, c("maximize", "minimize", "zero", "target"))
# Specifications for inputs and output variables
# Maybe these should be more specific (e.g. "factor", "numeric", etc).
# Should also specify max levels for factor inputs or outputs?
inputs <- rlang::arg_match0(inputs, c("all", "qualitative", "quantitative"))
outputs <- rlang::arg_match0(outputs, c("all", "qualitative", "quantitative"))
# pkgs: character string of external packages used to compute the filter
# maybe also set default arguments and a list that can't be altered by the user?
res <-
list(
name = name,
label = label,
goal = goal,
inputs = inputs,
outputs = outputs,
pkgs = pkgs
)
class(res) <- c(paste0("filter_method_", outputs), "filter_method")
res
}
I had a long train ride and did a draft implementation (in topepo/colino
fork) for a few methods:
library(tidymodels)
library(colino) # remotes::install_github("topepo/colino")
tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
data(cells)
cells$case <- NULL
fit_xy(
colino:::filter_roc_auc,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 fiber_width_ch_1 0.833
#> 2 total_inten_ch_2 0.805
#> 3 total_inten_ch_1 0.790
#> 4 shape_p_2_a_ch_1 0.786
#> 5 avg_inten_ch_2 0.777
#> 6 convex_hull_area_ratio_ch_1 0.772
#> 7 avg_inten_ch_1 0.760
#> 8 entropy_inten_ch_1 0.759
#> 9 convex_hull_perim_ratio_ch_1 0.747
#> 10 var_inten_ch_1 0.727
#> # ℹ 46 more rows
fit_xy(
colino:::filter_mrmr,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 total_inten_ch_4 0.644
#> 2 entropy_inten_ch_1 -0.0736
#> 3 avg_inten_ch_2 -0.0740
#> 4 skew_inten_ch_4 -0.0754
#> 5 convex_hull_perim_ratio_ch_1 -0.0761
#> 6 shape_bfr_ch_1 -0.0764
#> 7 inten_cooc_contrast_ch_3 -0.0772
#> 8 eq_sphere_vol_ch_1 -0.0779
#> 9 spot_fiber_count_ch_4 -0.0783
#> 10 diff_inten_density_ch_1 -0.0801
#> # ℹ 46 more rows
fit_xy(
colino:::filter_info_gain,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 total_inten_ch_2 0.189
#> 2 fiber_width_ch_1 0.174
#> 3 avg_inten_ch_2 0.137
#> 4 shape_p_2_a_ch_1 0.130
#> 5 total_inten_ch_1 0.130
#> 6 convex_hull_area_ratio_ch_1 0.112
#> 7 avg_inten_ch_1 0.109
#> 8 entropy_inten_ch_1 0.103
#> 9 skew_inten_ch_1 0.0922
#> 10 convex_hull_perim_ratio_ch_1 0.0898
#> # ℹ 46 more rows
fit_xy(
colino:::filter_info_gain_ratio,
x = cells %>% select(-class),
y = cells %>% select(class)
)
#> # A tibble: 56 × 2
#> variable score
#> <chr> <dbl>
#> 1 total_inten_ch_2 0.158
#> 2 fiber_width_ch_1 0.126
#> 3 avg_inten_ch_2 0.106
#> 4 total_inten_ch_1 0.0982
#> 5 shape_p_2_a_ch_1 0.0978
#> 6 convex_hull_area_ratio_ch_1 0.0855
#> 7 avg_inten_ch_1 0.0828
#> 8 fiber_length_ch_1 0.0823
#> 9 skew_inten_ch_1 0.0767
#> 10 entropy_inten_ch_1 0.0754
#> # ℹ 46 more rows
data(ames)
ames$Sale_Price <- log10(ames$Sale_Price)
num_col <- c("Longitude", "Latitude", "Year_Built", "Lot_Area", "Gr_Liv_Area")
fac_col <- c("MS_Zoning", "Central_Air", "Neighborhood")
fit_xy(
colino:::filter_corr,
x = ames %>% select(all_of(num_col)),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#> variable score
#> <chr> <dbl>
#> 1 Gr_Liv_Area 0.696
#> 2 Year_Built 0.615
#> 3 Longitude 0.292
#> 4 Latitude 0.286
#> 5 Lot_Area 0.255
fit_xy(
colino:::filter_max_diff,
x = ames %>% select(all_of(fac_col)),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 3 × 2
#> variable score
#> <chr> <dbl>
#> 1 MS_Zoning 0.814
#> 2 Neighborhood 0.531
#> 3 Central_Air 0.262
fit_xy(
colino:::filter_rf_imp,
x = ames %>% select(all_of(c(fac_col, num_col))),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 8 × 2
#> variable score
#> <chr> <dbl>
#> 1 Gr_Liv_Area 18.4
#> 2 Year_Built 15.4
#> 3 Longitude 7.86
#> 4 Latitude 6.25
#> 5 Lot_Area 5.53
#> 6 Central_Air 4.05
#> 7 Neighborhood 3.97
#> 8 MS_Zoning 3.91
fit_xy(
colino:::filter_mic,
x = ames %>% select(all_of(num_col)),
y = ames %>% select(Sale_Price)
)
#> # A tibble: 5 × 2
#> variable score
#> <chr> <dbl>
#> 1 Longitude 0.463
#> 2 Gr_Liv_Area 0.441
#> 3 Year_Built 0.436
#> 4 Latitude 0.420
#> 5 Lot_Area 0.234
Created on 2023-07-10 with reprex v2.0.2
Hi Max, many thanks for this - I wish my train rides were as productive! I've just started working through this - actually I wasn't aware of your desirability2 package - definitely will be looking at this particular for those MRMR type of cases.
Back to the filter/ranking methods - I can definitely can add these (and the remaining filter methods), so that the fit_xy
generic can be used on any supplied filter, which maybe gets built into something like a step_filter_supervised
for example.
Some other thoughts/ramblings are:
mtry
in a rf_imp filter? Most other ML libraries like sklearn or mlr3 allow tuning of almost everything, even it is creates some awkward syntax with those 'pipelinestepmodelnameparameter' sort of keys.fit_xy
method is essentially reimplementing each recipe step. I guess I should look at reversing this, so that each specific recipe step, like step_filter_infgain
uses the fit_xy
generic internally to avoid duplication.How to specify / supply arguments to the methods...
The underlying argument names are an open question for me. We can parameterize them for the individual filter functions. For a multi-method filter, I'm not sure the best way to specific them.
There is also the idea of considering the choice of filtering method as a hyperparameter,...
That's a great idea.
How to reuse some of those components...
I think that the prep()
methods for the steps can call fit_xy()
.
Trying to think about how many steps this would be applicable to?...
I would try to do all of them (within reason).
I think that we can use this in different packages too. I plan on adding a recursive feature engineering function (maybe to finetune) and these would be useful.
I wish my train rides were as productive!
On the train ride back, I put the into a side-package that we could all use: https://github.com/topepo/filterdb
Some of these methods are based on what you first added so I planned on adding you as a contributor (if you want that).
I added some open questions/todo's in the package too. I'll convert these to issues this week.
Can we design some common infrastructure across filter methods? In other words the underlying filtering methods have parameters on the inputs (e.g. types of variables allowed) and the outputs (such as minimize/maximize), and so on? This is not unlike how yardstick organizes performance metrics.
One of the goals is to be able to make composite filters (e.g. maximize ROC AUC and pick the three largest importance scores). I have a private package that I've been kicking around for a while (ironically called colander - I'll send you an invite) that was a prototype for these types of filters.
If we had modular methods with control filter names, we could also reduce the total number of steps and have new ones that are flexible and work across all filter methods:
I also have some working code to use desirability functions:
The organizational parts in colander are not all that great right now but I think that the idea is a good one.