h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.78k stars 1.99k forks source link

Add Feature Selection function #12136

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Add feature selection function in H2O that would remove any unhelpful or harmful features.

One method of performing feature selection is Boruta (described here: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/). This is different from removing x variables with the lowest variable importance because it can also remove variables with high variable importance that are hurting the model.

For example, high cardinality categorical columns often have very high variable importance in algorithms like GBM and Random Forest but they can actually hurt model performance. Boruta should be able to identify these types of variables as well as variables with little to no impact on the model.

It may make sense to add this as a parameter for each algorithm (for example: feature_selection = True/False) and the feature selection method would be specific for the algorithm. For GBM and Random Forest, the feature selection method could be Bortua. For GLM, the feature selection method could be Lasso regularization.

exalate-issue-sync[bot] commented 1 year ago

Megan Kurka commented: Questions to consider:

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:3ae3c86a-e56a-4211-99d4-9a8cf5ab63f6] please add a link to automl proposal that have different versions of FS algos

exalate-issue-sync[bot] commented 1 year ago

Nkululeko Thangelane commented: This would be a great feature to implement for H2O

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5264 Assignee: UNASSIGNED Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A