h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

H2O ignores column types when importing files #12745

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Hi,

We ran into problem where an AutoML trained model would fail predicting saying that a column in the training frame was real-values, whereas the frame to score has that column as enum:

{code} OSError: Job with key $03010a00020f32d4ffffffff$_829837b2954aec762457cf0b77e1428a failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has categorical column 'FOOBAR' which is real-valued in the training data {code}

All our h2o.import_file calls are setting column types explicitly so it was surprising to get this. Upon closer inspection it seems that H2O forces any empty columns into something like "int". I say "something like" because in H2O Flow if I try to change the column type to "enum" I get an error calling the type {{BAD}}:

{code} Error evaluating cell Error calling POST /99/Rapids with opts {"ast":"(assign temp.hex (:= temp.hex (...

ERROR MESSAGE: asfactor() requires a string, categorical, or numeric column. Received BAD. Please convert column to a string or categorical first. {code}

If following the suggestion in the error I try to change the type to: 'int', 'real' or 'string' - I receive this error:

{code} Error evaluating cell Error calling POST /99/Rapids with opts {"ast":"(assign temp.hex (:= temp.hex (...

ERROR MESSAGE: Unrecognized column type BAD given to toNumericVec() {code}

For some context on why we have empty columns in the first place. We have a some orchestration code written around H2O, handling processes of loading data, traingin, scoring, etc. We are integration testing this code by running them against H2O running on development machine with datasets severely reduced in size (e.g. 100 rows).

H2O Version in question is 3.21.0.4345 because I believe [fix for this issue|https://0xdata.atlassian.net/browse/PUBDEV-5663] has not reached the mainline.

I will find a workaround for time being, so it is not very urgent, but I think H2O silently ignoring user set preferences without a warning of any kind is a bug. Ideally, it would use the correct type even if the column is empty.

I attached a small script to reproduce the error.

Let me know if I can help with anything else.

Thank you!

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5893 Assignee: New H2O Bugs Reporter: Julius Šėporaitis State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: h2o_column_types.py Attached By: Julius Šėporaitis File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5893/h2o_column_types.py