H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Notice that "1a" and "1b" are exactly the same data with just the columns ordered differently. Now I import all three of them into H2OFrame as follows:
The first merge works as expected. But the second one fails with error "Merging columns must be the same type, column salePrice found types Enum and Numeric"
When I checked the logs on the h2o server, I found that the by_x column indices sent to the server were exactly the same for both the merges(and ordered - [0 1 2]), in spite of the different order in which the by_x columns occur in df1b. From the server logs..
For my second merge, the by_x column indices should have been [2 0 1]. Instead, the python client sent [0 1 2] because of the {{list(set(tmp))}} statement in line #3422.
Wouldn't it be better to fail fast and throw an error if the supplied by_x or by_y column names are not unique, instead of collecting them into a set which obviously won't preserve the order?
Also the error message was misleading until I checked the code as the message had a reference to the salePrice column which was neither in my by_x nor by_y.
*H2O 3.24.0.4, Python 3.6.8, [https://stackoverflow.com/questions/56584254/python-h2oframe-inconsistent-behavior-on-merge-caused-by-column-ordering|https://stackoverflow.com/questions/56584254/python-h2oframe-inconsistent-behavior-on-merge-caused-by-column-ordering]*
I have the following CSV files that I'm importing into H2OFrames.
CSV 1(a):
{code:java}year,manufacturer,model,salePrice 2010,HONDA,CIVIC,100 2011,TOYOTA,CAMRY,150 2010,HONDA,CIVIC,50 2011,TOYOTA,CAMRY,200 2010,HONDA,CIVIC,150 2011,TOYOTA,CAMRY,250 2012,SUZUKI,SWIFT,500 2012,SUZUKI,SWIFT,600{code}
CSV 1(b):
{code:java}manufacturer,model,year,salePrice HONDA,CIVIC,2010,100 TOYOTA,CAMRY,2011,150 HONDA,CIVIC,2010,50 TOYOTA,CAMRY,2011,200 HONDA,CIVIC,2010,150 TOYOTA,CAMRY,2011,250 SUZUKI,SWIFT,2012,500 SUZUKI,SWIFT,2012,600{code}
CSV 2:
{code:java}year,manufacturer,model,bodyType 2010,HONDA,CIVIC,SEDAN 2011,TOYOTA,CAMRY,SEDAN 2012,SUZUKI,SWIFT,HATCHBACK{code}
Notice that "1a" and "1b" are exactly the same data with just the columns ordered differently. Now I import all three of them into H2OFrame as follows:
{code:python}import h2o h2o.init() df1a=h2o.import_file('csv1a.csv') df1b=h2o.import_file('csv1b.csv') df2=h2o.import_file('csv2.csv'){code}
And then I try the following merge operations:
{code:python}merge1=df1a.merge(df2, by_x=['year','manufacturer','model'], by_y=['year','manufacturer','model']) merge2=df1b.merge(df2, by_x=['year','manufacturer','model'], by_y=['year','manufacturer','model']){code}
The first merge works as expected. But the second one fails with error "Merging columns must be the same type, column salePrice found types Enum and Numeric"
When I checked the logs on the h2o server, I found that the by_x column indices sent to the server were exactly the same for both the merges(and ordered - [0 1 2]), in spite of the different order in which the by_x columns occur in df1b. From the server logs..
{code:java}parms={ast=(tmp= py_5_sid_abdb (merge h2o_bug_df1b1.hex h2o_bug_df21.hex False False [0 1 2] [0 1 2] 'auto')){code}
Then I found lines #1967 and #3422 in the code of the h2o.frame module: [https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L1967|https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L1967] [https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L3422|https://github.com/h2oai/h2o-3/blob/jenkins-rel-yates-4/h2o-py/h2o/frame.py#L3422]
For my second merge, the by_x column indices should have been [2 0 1]. Instead, the python client sent [0 1 2] because of the {{list(set(tmp))}} statement in line #3422.
Wouldn't it be better to fail fast and throw an error if the supplied by_x or by_y column names are not unique, instead of collecting them into a set which obviously won't preserve the order?
Also the error message was misleading until I checked the code as the message had a reference to the salePrice column which was neither in my by_x nor by_y.