Open exalate-issue-sync[bot] opened 1 year ago
JIRA Issue Migration Info
Jira Issue: PUBDEV-5893 Assignee: New H2O Bugs Reporter: Julius Šėporaitis State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A
Attachments From Jira
Attachment Name: h2o_column_types.py Attached By: Julius Šėporaitis File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5893/h2o_column_types.py
Hi,
We ran into problem where an AutoML trained model would fail predicting saying that a column in the training frame was real-values, whereas the frame to score has that column as enum:
{code} OSError: Job with key $03010a00020f32d4ffffffff$_829837b2954aec762457cf0b77e1428a failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has categorical column 'FOOBAR' which is real-valued in the training data {code}
All our
h2o.import_file
calls are setting column types explicitly so it was surprising to get this. Upon closer inspection it seems that H2O forces any empty columns into something like "int". I say "something like" because in H2O Flow if I try to change the column type to "enum" I get an error calling the type {{BAD}}:{code} Error evaluating cell Error calling POST /99/Rapids with opts {"ast":"(assign temp.hex (:= temp.hex (...
ERROR MESSAGE: asfactor() requires a string, categorical, or numeric column. Received BAD. Please convert column to a string or categorical first. {code}
If following the suggestion in the error I try to change the type to: 'int', 'real' or 'string' - I receive this error:
{code} Error evaluating cell Error calling POST /99/Rapids with opts {"ast":"(assign temp.hex (:= temp.hex (...
ERROR MESSAGE: Unrecognized column type BAD given to toNumericVec() {code}
For some context on why we have empty columns in the first place. We have a some orchestration code written around H2O, handling processes of loading data, traingin, scoring, etc. We are integration testing this code by running them against H2O running on development machine with datasets severely reduced in size (e.g. 100 rows).
H2O Version in question is 3.21.0.4345 because I believe [fix for this issue|https://0xdata.atlassian.net/browse/PUBDEV-5663] has not reached the mainline.
I will find a workaround for time being, so it is not very urgent, but I think H2O silently ignoring user set preferences without a warning of any kind is a bug. Ideally, it would use the correct type even if the column is empty.
I attached a small script to reproduce the error.
Let me know if I can help with anything else.
Thank you!