h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 1.99k forks source link

Auto-detect unique ID columns and remove from predictors #7555

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

We can auto-detect unique id columns (#unique values == #observations) in H2OFrames and ignore them in the set of predictor columns (with a warning that they have been removed).

Motivation: ID columns are not useful for prediction, and when they are encoded as a factor, they will cause all sorts of performance issues since they will be super high-cardinality columns. If the ID column is numeric, there's not as much of a performance issue, but it's just a useless column.

Suggestion by [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] for AutoML: on AutoML level I think we can go further and do some gentle preprocessing, eg. if #unique values < #observations && #unique values == #non-NA values => substitute for a column “value.x.defined” with yes/no values (this way you will preserve information if something has NA or no, eg. missing social security number might be a good feature for an algo)

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8094 Assignee: New H2O Bugs Reporter: Erin LeDell State: Open Fix Version: Backlog Attachments: N/A Development PRs: Available

h2o-ops commented 1 year ago

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/6422