stevenpawley / recipeselectors

Additional recipes for supervised feature selection to be used with the tidymodels recipes package
https://stevenpawley.github.io/recipeselectors/
Other
55 stars 7 forks source link

step_select_vip & dummy variables #10

Closed cgoo4 closed 2 years ago

cgoo4 commented 2 years ago

step_dummy followed by step_select_vip for all_predictors results in the top_p predictors plus the dummy variables. Is it possible to include the dummy variables in the top_p?

stevenpawley commented 2 years ago

Hello, sorry for the delayed response - only just getting back to open-source work. So, I think that you are wanting step_select_vip to only be applied to the categorical variable (before it is encoded) not the individual categories that have been transformed into separate binary variables?

If this is correct, then a variable importance based filter method is problematic because if a model requires categorical variables to be encoded, then the feature importance scores will always include the individual dummy variables.

I can only provide some info on how I perform with when using other libraries, for example scikit-learn or mlr3. For these, I would use a wrapper method when the selection is based on permutation importance. The one hot encoding would be wrapped into a pipeline with the learner model, and this would go inside the permutation method, so that the permutation scores represent the individual variables prior to one hot encoding.