Merge operation produces dataframe with rows order different from pandas

gshimansky commented 3 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

Ubuntu 19.10

Modin version (modin.__version__):

0.8.1.1+4.gbd8f73a

Python version:

Python 3.8.3

Code we can use to reproduce:

import os
import gc
import timeit
import modin.pandas as pd
#import pandas as pd

df1 = pd.read_csv("df1.csv", dtype={'id4': 'category', 'id5': 'category', 'id6': 'category'})
df2 = pd.read_csv("df2.csv", dtype={'id4': 'category', 'id5': 'category'})

print("df1 = \n", df1)
print("df2 = \n", df2)
df3 = df1.merge(df2, on='id2')
print(df3.shape, flush=True)
chk = [df3['v1'].sum(), df3['v2'].sum()]
print("chk = ", chk)
pd.set_option("display.max_rows", 1000000, "display.max_columns", 1000000)
print(df3, flush=True)

Describe the problem

This is a simplified and stripped down code from join test in h2o benchmark. For me it produces different output for pandas and modin, rows are the same but order is different (note rows 3-6):

Pandas
   id1_x  id2     id3 id4_x  id5_x       id6         v1  id1_y id4_y  id5_y  \
0      1  767  145412   id1  id767  id145412  28.542724      1   id1  id767
1      1  767  426077   id1  id767  id426077  74.374688      1   id1  id767
2      1  812  477505   id1  id812  id477505  15.382738      1   id1  id812
3      1  812  703719   id1  id812  id703719  46.007404      1   id1  id812
4      1  287  270849   id1  id287  id270849  34.673300      1   id1  id287
5      1  906  588071   id1  id906  id588071  65.627694      1   id1  id906
6      1  483  699261   id1  id483  id699261  69.296542      1   id1  id483
7      1  630  956216   id1  id630  id956216  22.269699      1   id1  id630
8      1  686  389682   id1  id686  id389682  50.464490      1   id1  id686

Modin
   id1_x  id2     id3 id4_x  id5_x       id6         v1  id1_y id4_y  id5_y  \
0      1  767  145412   id1  id767  id145412  28.542724      1   id1  id767
1      1  767  426077   id1  id767  id426077  74.374688      1   id1  id767
2      1  812  477505   id1  id812  id477505  15.382738      1   id1  id812
3      1  287  270849   id1  id287  id270849  34.673300      1   id1  id287
4      1  906  588071   id1  id906  id588071  65.627694      1   id1  id906
5      1  483  699261   id1  id483  id699261  69.296542      1   id1  id483
6      1  812  703719   id1  id812  id703719  46.007404      1   id1  id812
7      1  630  956216   id1  id630  id956216  22.269699      1   id1  id630
8      1  686  389682   id1  id686  id389682  50.464490      1   id1  id686

Source code / logs

Datafiles data.zip

YarShev commented 3 years ago

After some investigation it turned out pandas does merge as follows:

Takes key from the join column of the left DataFrame.
Searches for all entries of the key in the entire column of the left DataFrame.
Does merge for the entries of the key which match to keys from the join column of the right DataFrame.

We do merge as follows:

Divide the left DataFrame into row partitions.
Broadcast entire the right DataFrame on every row partition of the left DataFrame and perform merge between them.

Since we have the join column of the left DataFrame which is divided into several parts, then we can't ensure that all entries of the key will be find in the join column. I suggest adding a note regarding this in our documentation but keeping the issue is opened in order to fix it in the future if it is possible. @devin-petersohn , @gshimansky , what do you think?

gshimansky commented 3 years ago

According to merge spec in inner mode order should be the same as in left frame.

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

Do you think that we break this requirement when order is inner?

gshimansky commented 3 years ago

Actually left and right also have key order requirements

left: use only keys from left frame, similar to a SQL left outer join; preserve key order. right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

YarShev commented 3 years ago

I don't think we break the requirements on order inner and left because there is parallel implementation only for these types of join. Both Modin's and pandas's orders are similar to SQL join but as I said pandas does as follows:

Takes key from the join column of the left DataFrame.
Searches for all entries of the key in the entire column of the left DataFrame.
Does merge for the entries of the key which match to keys from the join column of the right DataFrame.

May be it is better to see an example.

import modin.pandas as pd
pdf1 = pandas.read_csv("df1.csv", dtype={'id4': 'category', 'id5': 'category', 'id6': 'category'})
pdf2 = pandas.read_csv("df2.csv", dtype={'id4': 'category', 'id5': 'category'})
pdf1
    id1  id2     id3  id4    id5       id6         v1
0     1  891   60496  id1  id891   id60496  81.391571
1     1  891  787673  id1  id380  id787673  30.193613
2     1  752  325820  id1  id807  id325820  68.204899
3     1  752  196487  id1  id410  id196487  17.185598
4     1  891  123566  id1  id903  id123566  16.056599
5     1  752  644001  id1  id693  id644001  42.237435
6     1  891  175906  id1  id607  id175906  16.652637
7     1  752  470410  id1  id519  id470410  16.660687
8     1  752  145412  id1  id767  id145412  28.542724
9     1  750  665777  id1   id22  id665777  40.815938
10    1  891  915093  id1  id440  id915093  91.560113
11    1  750  145412  id1  id440  id665777  40.815938
12    1  750  145412  id1  id440  id665777  40.815938
13    1  752  145412  id1  id440  id665777  40.815938
14    1  891  145412  id2  id441  id665778  41.815938
15    1  752  145412  id3  id442  id665779  42.815938
16    1   33  145412  id4  id443  id665780  43.815938
####################################
pdf2
    id1   id2  id4     id5         v2
0     1    33  id1    id34  58.434734
1     1   891  id1    id10  16.705822
2     1   821  id1   id821  46.066396
3     1   243  id1   id243  60.372908
4     1   750  id1   id824  53.858642
5     1   105  id1   id105  72.612812
6     1   421  id1   id421  81.306241
7     1   752  id1   id264  68.709592
8     2  1041  id2  id1041  74.682086
9     1   127  id1   id127  62.728320
10    1   187  id1   id187  53.797878
11    1   360  id1   id360  34.490221
12    1   917  id1   id917   6.746300
####################################
pdf3 = pdf1.merge(pdf2, on='id2')
    id1_x  id2     id3 id4_x  id5_x       id6         v1  id1_y id4_y  id5_y         v2
0       1  891   60496   id1  id891   id60496  81.391571      1   id1   id10  16.705822
1       1  891  787673   id1  id380  id787673  30.193613      1   id1   id10  16.705822
2       1  891  123566   id1  id903  id123566  16.056599      1   id1   id10  16.705822
3       1  891  175906   id1  id607  id175906  16.652637      1   id1   id10  16.705822
4       1  891  915093   id1  id440  id915093  91.560113      1   id1   id10  16.705822
5       1  891  145412   id2  id441  id665778  41.815938      1   id1   id10  16.705822
6       1  752  325820   id1  id807  id325820  68.204899      1   id1  id264  68.709592
7       1  752  196487   id1  id410  id196487  17.185598      1   id1  id264  68.709592
8       1  752  644001   id1  id693  id644001  42.237435      1   id1  id264  68.709592
9       1  752  470410   id1  id519  id470410  16.660687      1   id1  id264  68.709592
10      1  752  145412   id1  id767  id145412  28.542724      1   id1  id264  68.709592
11      1  752  145412   id1  id440  id665777  40.815938      1   id1  id264  68.709592
12      1  752  145412   id3  id442  id665779  42.815938      1   id1  id264  68.709592
13      1  750  665777   id1   id22  id665777  40.815938      1   id1  id824  53.858642
14      1  750  145412   id1  id440  id665777  40.815938      1   id1  id824  53.858642
15      1  750  145412   id1  id440  id665777  40.815938      1   id1  id824  53.858642
16      1   33  145412   id4  id443  id665780  43.815938      1   id1   id34  58.434734

We can see pandas sorts on equal entries of keys. We do such thing partially as we divide DataFrame into row partitions. I think it is not critical.

pyrito commented 2 years ago

@gshimansky I tried reproducing the bug on the latest master with the following:

In [4]: import os
   ...: import gc
   ...: import timeit
   ...: import modin.pandas as pd
   ...: import pandas
   ...: from modin.pandas.test.utils import df_equals
   ...: 
   ...: df1 = pd.read_csv("df1.csv", dtype={'id4': 'category', 'id5': 'category', 'id6': 'category'})
   ...: df2 = pd.read_csv("df2.csv", dtype={'id4': 'category', 'id5': 'category'})
   ...: pdf1 = df1._to_pandas()
   ...: pdf2 = df2._to_pandas()
   ...: 
   ...: df_equals(df1, pdf1)
   ...: df_equals(df2, pdf2)
   ...: 
   ...: print("df1 = \n", df1)
   ...: print("df2 = \n", df2)
   ...: df3 = df1.merge(df2, on='id2')
   ...: pdf3 = pdf1.merge(pdf2, on='id2')
   ...: print(df3.shape, flush=True)
   ...: 
   ...: df_equals(df3, pdf3)
   ...: 
   ...: chk = [df3['v1'].sum(), df3['v2'].sum()]
   ...: pchk = [pdf3['v1'].sum(), pdf3['v2'].sum()]
   ...: assert chk == pchk
   ...: print("chk = ", chk)
df1 = 
     id1  id2     id3  id4    id5       id6         v1
0     1  891   60496  id1  id891   id60496  81.391571
1     1  380  787673  id1  id380  id787673  30.193613
2     1  807  325820  id1  id807  id325820  68.204899
3     1  410  196487  id1  id410  id196487  17.185598
4     1  903  123566  id1  id903  id123566  16.056599
..  ...  ...     ...  ...    ...       ...        ...
75    1  383  762796  id1  id383  id762796  38.322020
76    1   60   72812  id1   id60   id72812  81.327288
77    1   72  146864  id1   id72  id146864  54.909071
78    1  358   60397  id1  id358   id60397  37.021526
79    1  186  570607  id1  id186  id570607  25.336286

[80 rows x 7 columns]
df2 = 
     id1   id2  id4     id5         v2
0     1    34  id1    id34  58.434734
1     1    10  id1    id10  16.705822
2     1   821  id1   id821  46.066396
3     1   243  id1   id243  60.372908
4     1   824  id1   id824  53.858642
..  ...   ...  ...     ...        ...
75    2  1259  id2  id1259  37.180976
76    1   595  id1   id595  84.506771
77    1   510  id1   id510   1.905531
78    1   287  id1   id287  61.726441
79    1   600  id1   id600  18.217756

[80 rows x 5 columns]
(9, 11)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [4], in <cell line: 22>()
     19 pdf3 = pdf1.merge(pdf2, on='id2')
     20 print(df3.shape, flush=True)
---> 22 df_equals(df3, pdf3)
     24 chk = [df3['v1'].sum(), df3['v2'].sum()]
     25 pchk = [pdf3['v1'].sum(), pdf3['v2'].sum()]

File ~/Documents/modin/modin/pandas/test/utils.py:589, in df_equals(df1, df2)
    586     assert_empty_frame_equal(df1, df2)
    588 if isinstance(df1, pandas.DataFrame) and isinstance(df2, pandas.DataFrame):
--> 589     assert_frame_equal(
    590         df1,
    591         df2,
    592         check_dtype=False,
    593         check_datetimelike_compat=True,
    594         check_index_type=False,
    595         check_column_type=False,
    596         check_categorical=False,
    597     )
    598     df_categories_equals(df1, df2)
    599 elif isinstance(df1, pandas.Index) and isinstance(df2, pandas.Index):

    [... skipping hidden 2 frame]

File ~/opt/anaconda3/envs/modin/lib/python3.9/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/modin/lib/python3.9/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/modin/lib/python3.9/site-packages/pandas/_testing/asserters.py:682, in raise_assert_detail(obj, message, left, right, diff, index_values)
    679 if diff is not None:
    680     msg += f"\n[diff]: {diff}"
--> 682 raise AssertionError(msg)

AssertionError: DataFrame.iloc[:, 1] (column name="id2") are different

DataFrame.iloc[:, 1] (column name="id2") values are different (66.66667 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8]
[left]:  [767, 812, 287, 767, 906, 483, 812, 630, 686]
[right]: [767, 767, 812, 812, 287, 906, 483, 630, 686]

modin-project / modin