h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

Remove categorical limit in H2OFrame #8105

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

When importing file that has high cardinality columns, I am getting the following

Error: DistributedException from /172.16.2.196:55888: 'Exceeded categorical limi
t on column #3 (using 1-based indexing).  Consider reparsing this column as a st
ring.', caused by water.parser.ParseDataset$H2OParseException: Exceeded categori
cal limit on column #3 (using 1-based indexing).  Consider reparsing this column
 as a string.
Execution halted

using recent h2o.

I understand that it is caused by a column being of type enum, and it suggests to use string, but still, we could remove the categorical limit, and allow to store high cardinality columns as enum not forcing users to use string. Using enum type, rather than string, is likely to speed up operations like h2o.merge, even for a high cardinality enum.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7533 Assignee: New H2O Bugs Reporter: Jan Gorecki State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A