Auto-detect unique ID columns and remove from predictors

h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Apache License 2.0

6.86k stars 1.99k forks source link

We can auto-detect unique id columns (#unique values == #observations) in H2OFrames and ignore them in the set of predictor columns (with a warning that they have been removed).

Motivation: ID columns are not useful for prediction, and when they are encoded as a factor, they will cause all sorts of performance issues since they will be super high-cardinality columns. If the ID column is numeric, there's not as much of a performance issue, but it's just a useless column.

Suggestion by [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] for AutoML: on AutoML level I think we can go further and do some gentle preprocessing, eg. if #unique values < #observations && #unique values == #non-NA values => substitute for a column “value.x.defined” with yes/no values (this way you will preserve information if something has NA or no, eg. missing social security number might be a good feature for an algo)

h2oai / h2o-3

Auto-detect unique ID columns and remove from predictors #7555