Remove categorical limit in H2OFrame

h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Apache License 2.0

6.89k stars 2k forks source link

When importing file that has high cardinality columns, I am getting the following

Error: DistributedException from /172.16.2.196:55888: 'Exceeded categorical limi
t on column #3 (using 1-based indexing).  Consider reparsing this column as a st
ring.', caused by water.parser.ParseDataset$H2OParseException: Exceeded categori
cal limit on column #3 (using 1-based indexing).  Consider reparsing this column
 as a string.
Execution halted

using recent h2o.

I understand that it is caused by a column being of type enum, and it suggests to use string, but still, we could remove the categorical limit, and allow to store high cardinality columns as enum not forcing users to use string. Using enum type, rather than string, is likely to speed up operations like h2o.merge, even for a high cardinality enum.

h2oai / h2o-3

Remove categorical limit in H2OFrame #8105