Add Feature Selection function

exalate-issue-sync[bot] commented 1 year ago

Add feature selection function in H2O that would remove any unhelpful or harmful features.

One method of performing feature selection is Boruta (described here: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/). This is different from removing x variables with the lowest variable importance because it can also remove variables with high variable importance that are hurting the model.

For example, high cardinality categorical columns often have very high variable importance in algorithms like GBM and Random Forest but they can actually hurt model performance. Boruta should be able to identify these types of variables as well as variables with little to no impact on the model.

It may make sense to add this as a parameter for each algorithm (for example: feature_selection = True/False) and the feature selection method would be specific for the algorithm. For GBM and Random Forest, the feature selection method could be Bortua. For GLM, the feature selection method could be Lasso regularization.

exalate-issue-sync[bot] commented 1 year ago

Megan Kurka commented: Questions to consider:

how does the final algorithm used effect the feature selection?
how does the categorical encoding of the final algorithm effect the feature selection?
how do we prevent overfitting? should we use a separate frame for feature selection?
should the feature selection method be different for the different algorithms?
should feature selection be a parameter within each algorithm or should it be a separate function like grid search?

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:3ae3c86a-e56a-4211-99d4-9a8cf5ab63f6] please add a link to automl proposal that have different versions of FS algos

exalate-issue-sync[bot] commented 1 year ago

Nkululeko Thangelane commented: This would be a great feature to implement for H2O

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5264 Assignee: UNASSIGNED Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

Add Feature Selection function #12136