h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 1.99k forks source link

Allow rbind on missing value columns #9096

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

_emphasized text_We should allow users to rbind two frames even if one frame contains all missing values in some of its columns. Currently, you have to fill all-missing-value columns, before performing the rbind.

Rbind requires that columns have the same type, but you can't change the type of an all-missing-value column without breaking the frame.

Here is a reproducible example, first part shows how changing the type of a missing-values column causes a frame to break, the second part shows a workaround.

{code} import h2o h2o.init()

df1 = h2o.import_file('rbind_datasets/test1.csv', col_types=['factor']5) df2 = h2o.import_file('rbind_datasets/test2.csv', col_types=['factor']5, header=-1) for x in df1.columns: print(x, df1.type(x), df2.type(x))

this will break the df2 frame because you appy ascharacter or asfactor to a missing value column

print(df2.type('C3'), df2.type('C4')) df2['C3'] = df2['C3'].ascharacter() df2['C4'] = df2['C4'].ascharacter() print(df2.type('C3'), df2.type('C4'))

below stacktrace will appear if you try and access df2

print(df2)

H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:

Error: Unrecognized column type BAD given to toStringVec().

Request: POST /99/Rapids

data: {'ast': "(tmp= py_16_sid_9da5 (:= (tmp= py_15_sid_9da5 (:= test27.hex (as.character (cols_py test27.hex 'C3')) 2 [])) (as.character (cols_py py_15_sid_9da5 'C4')) 3 []))", 'session_id': '_sid_9da5'}

workaround is to fill the missing values first

df1 = h2o.import_file('rbind_datasets/test1.csv', col_types=['factor']5) df2 = h2o.import_file('rbind_datasets/test2.csv', col_types=['factor']5, header=-1)

df2['C4'] = 99999999 df2['C3'] = 99999999

df2['C3'] = df2['C3'].asfactor() df2['C4'] = df2['C4'].asfactor()

df2.rbind(df1) {code}

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: #94470 (https://support.h2o.ai/a/tickets/94470) - H2O: factor datatypes

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: fixed vec.ascharacter() and vec.asfactor() to work on missing value columns

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6534 Assignee: Jan Sterba Reporter: Lauren DiPerna State: Resolved Fix Version: 3.24.0.5 Attachments: Available (Count: 2) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/3562

Attachments From Jira

Attachment Name: test1.csv Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6534/test1.csv

Attachment Name: test2.csv Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6534/test2.csv