Closed exalate-issue-sync[bot] closed 1 year ago
Nidhi Mehta commented: #94470 (https://support.h2o.ai/a/tickets/94470) - H2O: factor datatypes
Jan Sterba commented: fixed vec.ascharacter() and vec.asfactor() to work on missing value columns
JIRA Issue Migration Info
Jira Issue: PUBDEV-6534 Assignee: Jan Sterba Reporter: Lauren DiPerna State: Resolved Fix Version: 3.24.0.5 Attachments: Available (Count: 2) Development PRs: Available
Linked PRs from JIRA
https://github.com/h2oai/h2o-3/pull/3562
Attachments From Jira
Attachment Name: test1.csv Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6534/test1.csv
Attachment Name: test2.csv Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6534/test2.csv
_emphasized text_We should allow users to rbind two frames even if one frame contains all missing values in some of its columns. Currently, you have to fill all-missing-value columns, before performing the rbind.
Rbind requires that columns have the same type, but you can't change the type of an all-missing-value column without breaking the frame.
Here is a reproducible example, first part shows how changing the type of a missing-values column causes a frame to break, the second part shows a workaround.
{code} import h2o h2o.init()
df1 = h2o.import_file('rbind_datasets/test1.csv', col_types=['factor']5) df2 = h2o.import_file('rbind_datasets/test2.csv', col_types=['factor']5, header=-1) for x in df1.columns: print(x, df1.type(x), df2.type(x))
this will break the df2 frame because you appy ascharacter or asfactor to a missing value column
print(df2.type('C3'), df2.type('C4')) df2['C3'] = df2['C3'].ascharacter() df2['C4'] = df2['C4'].ascharacter() print(df2.type('C3'), df2.type('C4'))
below stacktrace will appear if you try and access df2
print(df2)
H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
Error: Unrecognized column type BAD given to toStringVec().
Request: POST /99/Rapids
data: {'ast': "(tmp= py_16_sid_9da5 (:= (tmp= py_15_sid_9da5 (:= test27.hex (as.character (cols_py test27.hex 'C3')) 2 [])) (as.character (cols_py py_15_sid_9da5 'C4')) 3 []))", 'session_id': '_sid_9da5'}
workaround is to fill the missing values first
df1 = h2o.import_file('rbind_datasets/test1.csv', col_types=['factor']5) df2 = h2o.import_file('rbind_datasets/test2.csv', col_types=['factor']5, header=-1)
df2['C4'] = 99999999 df2['C3'] = 99999999
df2['C3'] = df2['C3'].asfactor() df2['C4'] = df2['C4'].asfactor()
df2.rbind(df1) {code}