stevenpawley / recipeselectors

Additional recipes for supervised feature selection to be used with the tidymodels recipes package
https://stevenpawley.github.io/recipeselectors/
Other
55 stars 7 forks source link

Which feature selectors for regression to use? #9

Closed gundalav closed 2 years ago

gundalav commented 2 years ago

Hi Steven,

Thank you so much for making this great package!

Among the list of selectors that you showed here, which one is usable for regression (i.e the target outcome is numerical values)?

I've tried step_select_mrmr() with this data.

> toxic_feat_outcome_dat

# A tibble: 30 × 13
   toxic_outcome foo_energy charge boman hmoment
           <dbl>              <dbl>      <dbl>     <dbl>       <dbl>
 1         0.570              -750.      0.943      1.61       0.641
 2         0.626              -750.      6.09       5.30       0.278
 3         1.49              -1120.      6.99       2.49       0.461
 4         2.15               -938.      9.09       3.29       0.623
 5         1.04               -927.      3.12       2.66       0.469
 6         1.57              -1272.      9.00       5.73       0.604
 7         1.99              -1094.      4.57       4.33       0.329
 8         1.24               -933.      2.94       2.65       0.339
 9         1.40              -1076.      6.12       2.87       0.469
10         1.20              -1002.      4.94       3.48       0.427
# … with 20 more rows, and 8 more variables: hydrophobicity <dbl>,
#   insta <dbl>, length <dbl>, masshift <dbl>, mw <dbl>,
#   mz <dbl>, pi <dbl>, PEP <dbl>

It works for me. But I'm not sure if it's appropriate.

mrmr_rec <- recipe(toxic_outcome ~ ., data = toxic_feat_outcome_dat ) %>%
  step_select_mrmr(all_predictors(), outcome = "toxic_outcome", threads = 2,  
                   top_p = dim(toxic_feat_outcome_dat)[1], threshold = 0.9)

The reason I asked, it's because in your example, the outcome class is categorical.

library(recipes)
data(cells, package = "modeldata")
rec <-recipe(class ~ ., data = cells[, -1]) %>%
 step_select_mrmr(all_predictors(), outcome = "class", top_p = 10, threshold = 0.9)

Thanks and hope to hear from you again.

Sincerely, G.V.

stevenpawley commented 2 years ago

Overall, mutual information based methods, including mRMR are intended for classification models. However, these steps are using the praznik library, which does permit their application to regression models, but it uses binning to cut a numerical outcome into several equally sized categories. I think there is some debate over the appropriateness of this approach - obviously some information must be lost when binning.