h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.94k stars 2k forks source link

Python H2OFrame inconsistent behavior on merge caused by column ordering #8991

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

*H2O 3.24.0.4, Python 3.6.8, [https://stackoverflow.com/questions/56584254/python-h2oframe-inconsistent-behavior-on-merge-caused-by-column-ordering|https://stackoverflow.com/questions/56584254/python-h2oframe-inconsistent-behavior-on-merge-caused-by-column-ordering]*

I have the following CSV files that I'm importing into H2OFrames.

CSV 1(a):

{code:java}year,manufacturer,model,salePrice 2010,HONDA,CIVIC,100 2011,TOYOTA,CAMRY,150 2010,HONDA,CIVIC,50 2011,TOYOTA,CAMRY,200 2010,HONDA,CIVIC,150 2011,TOYOTA,CAMRY,250 2012,SUZUKI,SWIFT,500 2012,SUZUKI,SWIFT,600{code}

CSV 1(b):

{code:java}manufacturer,model,year,salePrice HONDA,CIVIC,2010,100 TOYOTA,CAMRY,2011,150 HONDA,CIVIC,2010,50 TOYOTA,CAMRY,2011,200 HONDA,CIVIC,2010,150 TOYOTA,CAMRY,2011,250 SUZUKI,SWIFT,2012,500 SUZUKI,SWIFT,2012,600{code}

CSV 2:

{code:java}year,manufacturer,model,bodyType 2010,HONDA,CIVIC,SEDAN 2011,TOYOTA,CAMRY,SEDAN 2012,SUZUKI,SWIFT,HATCHBACK{code}

Notice that "1a" and "1b" are exactly the same data with just the columns ordered differently. Now I import all three of them into H2OFrame as follows:

{code:python}import h2o h2o.init() df1a=h2o.import_file('csv1a.csv') df1b=h2o.import_file('csv1b.csv') df2=h2o.import_file('csv2.csv'){code}

And then I try the following merge operations:

{code:python}merge1=df1a.merge(df2, by_x=['year','manufacturer','model'], by_y=['year','manufacturer','model']) merge2=df1b.merge(df2, by_x=['year','manufacturer','model'], by_y=['year','manufacturer','model']){code}

The first merge works as expected. But the second one fails with error "Merging columns must be the same type, column salePrice found types Enum and Numeric"

When I checked the logs on the h2o server, I found that the by_x column indices sent to the server were exactly the same for both the merges(and ordered - [0 1 2]), in spite of the different order in which the by_x columns occur in df1b. From the server logs..

{code:java}parms={ast=(tmp= py_5_sid_abdb (merge h2o_bug_df1b1.hex h2o_bug_df21.hex False False [0 1 2] [0 1 2] 'auto')){code}

Then I found lines #1967 and #3422 in the code of the h2o.frame module: [https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L1967|https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L1967] [https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L3422|https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L3422]

For my second merge, the by_x column indices should have been [2 0 1]. Instead, the python client sent [0 1 2] because of the {{list(set(tmp))}} statement in line #3422.

Wouldn't it be better to fail fast and throw an error if the supplied by_x or by_y column names are not unique, instead of collecting them into a set which obviously won't preserve the order?

Also the error message was misleading until I checked the code as the message had a reference to the salePrice column which was neither in my by_x nor by_y.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6645 Assignee: New H2O Bugs Reporter: Prashanth Govindaraj State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A