h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

CV splits fail when applied on foldcolumn with cardinality lower its domain length #8153

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

h2. Issue

See the attached notebook to reproduce the issue + additional unexpected behaviour when counting unique values of a subframe. the notebook is using AutoML but the error is not specific to AutoML.

Also note that the error message is confusing:

{quote}{{water.exceptions.H2OIllegalArgumentException: Not enough data to create 5 random cross-validation splits. Either reduce nfolds, specify a larger dataset (or specify another random number seed, if applicable).}}{quote}

The message should be probably different when using {{fold_column}}.

h2. Suggestions

{code:python}new_fr = old[old['col'] = 'foo'].relevel(){code}

or, pandas style (similar to {{.loc}}):

{code:python}new_fr = old.sub[old['col'] = 'foo']{code}

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7485 Assignee: New H2O Bugs Reporter: Sebastien Poirier State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: PUBDEV-7485.ipynb Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7485/PUBDEV-7485.ipynb