Open exalate-issue-sync[bot] opened 1 year ago
Megan Kurka commented: Questions to consider:
Michal Kurka commented: [~accountid:557058:3ae3c86a-e56a-4211-99d4-9a8cf5ab63f6] please add a link to automl proposal that have different versions of FS algos
Nkululeko Thangelane commented: This would be a great feature to implement for H2O
JIRA Issue Migration Info
Jira Issue: PUBDEV-5264 Assignee: UNASSIGNED Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
Add feature selection function in H2O that would remove any unhelpful or harmful features.
One method of performing feature selection is Boruta (described here: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/). This is different from removing x variables with the lowest variable importance because it can also remove variables with high variable importance that are hurting the model.
For example, high cardinality categorical columns often have very high variable importance in algorithms like GBM and Random Forest but they can actually hurt model performance. Boruta should be able to identify these types of variables as well as variables with little to no impact on the model.
It may make sense to add this as a parameter for each algorithm (for example: feature_selection = True/False) and the feature selection method would be specific for the algorithm. For GBM and Random Forest, the feature selection method could be Bortua. For GLM, the feature selection method could be Lasso regularization.