Open gshimansky opened 3 years ago
After some investigation it turned out pandas does merge
as follows:
DataFrame
.DataFrame
.merge
for the entries of the key which match to keys from the join column of the right DataFrame
.We do merge
as follows:
DataFrame
into row partitions.DataFrame
on every row partition of the left DataFrame
and perform merge
between them.Since we have the join column of the left DataFrame
which is divided into several parts, then we can't ensure that all entries of the key will be find in the join column.
I suggest adding a note regarding this in our documentation but keeping the issue is opened in order to fix it in the future if it is possible. @devin-petersohn , @gshimansky , what do you think?
According to merge spec in inner
mode order should be the same as in left frame.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
Do you think that we break this requirement when order is inner
?
Actually left
and right
also have key order requirements
left: use only keys from left frame, similar to a SQL left outer join; preserve key order. right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
I don't think we break the requirements on order inner
and left
because there is parallel implementation only for these types of join. Both Modin's and pandas's orders are similar to SQL join but as I said pandas does as follows:
May be it is better to see an example.
import modin.pandas as pd
pdf1 = pandas.read_csv("df1.csv", dtype={'id4': 'category', 'id5': 'category', 'id6': 'category'})
pdf2 = pandas.read_csv("df2.csv", dtype={'id4': 'category', 'id5': 'category'})
pdf1
id1 id2 id3 id4 id5 id6 v1
0 1 891 60496 id1 id891 id60496 81.391571
1 1 891 787673 id1 id380 id787673 30.193613
2 1 752 325820 id1 id807 id325820 68.204899
3 1 752 196487 id1 id410 id196487 17.185598
4 1 891 123566 id1 id903 id123566 16.056599
5 1 752 644001 id1 id693 id644001 42.237435
6 1 891 175906 id1 id607 id175906 16.652637
7 1 752 470410 id1 id519 id470410 16.660687
8 1 752 145412 id1 id767 id145412 28.542724
9 1 750 665777 id1 id22 id665777 40.815938
10 1 891 915093 id1 id440 id915093 91.560113
11 1 750 145412 id1 id440 id665777 40.815938
12 1 750 145412 id1 id440 id665777 40.815938
13 1 752 145412 id1 id440 id665777 40.815938
14 1 891 145412 id2 id441 id665778 41.815938
15 1 752 145412 id3 id442 id665779 42.815938
16 1 33 145412 id4 id443 id665780 43.815938
####################################
pdf2
id1 id2 id4 id5 v2
0 1 33 id1 id34 58.434734
1 1 891 id1 id10 16.705822
2 1 821 id1 id821 46.066396
3 1 243 id1 id243 60.372908
4 1 750 id1 id824 53.858642
5 1 105 id1 id105 72.612812
6 1 421 id1 id421 81.306241
7 1 752 id1 id264 68.709592
8 2 1041 id2 id1041 74.682086
9 1 127 id1 id127 62.728320
10 1 187 id1 id187 53.797878
11 1 360 id1 id360 34.490221
12 1 917 id1 id917 6.746300
####################################
pdf3 = pdf1.merge(pdf2, on='id2')
id1_x id2 id3 id4_x id5_x id6 v1 id1_y id4_y id5_y v2
0 1 891 60496 id1 id891 id60496 81.391571 1 id1 id10 16.705822
1 1 891 787673 id1 id380 id787673 30.193613 1 id1 id10 16.705822
2 1 891 123566 id1 id903 id123566 16.056599 1 id1 id10 16.705822
3 1 891 175906 id1 id607 id175906 16.652637 1 id1 id10 16.705822
4 1 891 915093 id1 id440 id915093 91.560113 1 id1 id10 16.705822
5 1 891 145412 id2 id441 id665778 41.815938 1 id1 id10 16.705822
6 1 752 325820 id1 id807 id325820 68.204899 1 id1 id264 68.709592
7 1 752 196487 id1 id410 id196487 17.185598 1 id1 id264 68.709592
8 1 752 644001 id1 id693 id644001 42.237435 1 id1 id264 68.709592
9 1 752 470410 id1 id519 id470410 16.660687 1 id1 id264 68.709592
10 1 752 145412 id1 id767 id145412 28.542724 1 id1 id264 68.709592
11 1 752 145412 id1 id440 id665777 40.815938 1 id1 id264 68.709592
12 1 752 145412 id3 id442 id665779 42.815938 1 id1 id264 68.709592
13 1 750 665777 id1 id22 id665777 40.815938 1 id1 id824 53.858642
14 1 750 145412 id1 id440 id665777 40.815938 1 id1 id824 53.858642
15 1 750 145412 id1 id440 id665777 40.815938 1 id1 id824 53.858642
16 1 33 145412 id4 id443 id665780 43.815938 1 id1 id34 58.434734
We can see pandas sorts on equal entries of keys. We do such thing partially as we divide DataFrame
into row partitions. I think it is not critical.
@gshimansky I tried reproducing the bug on the latest master with the following:
In [4]: import os
...: import gc
...: import timeit
...: import modin.pandas as pd
...: import pandas
...: from modin.pandas.test.utils import df_equals
...:
...: df1 = pd.read_csv("df1.csv", dtype={'id4': 'category', 'id5': 'category', 'id6': 'category'})
...: df2 = pd.read_csv("df2.csv", dtype={'id4': 'category', 'id5': 'category'})
...: pdf1 = df1._to_pandas()
...: pdf2 = df2._to_pandas()
...:
...: df_equals(df1, pdf1)
...: df_equals(df2, pdf2)
...:
...: print("df1 = \n", df1)
...: print("df2 = \n", df2)
...: df3 = df1.merge(df2, on='id2')
...: pdf3 = pdf1.merge(pdf2, on='id2')
...: print(df3.shape, flush=True)
...:
...: df_equals(df3, pdf3)
...:
...: chk = [df3['v1'].sum(), df3['v2'].sum()]
...: pchk = [pdf3['v1'].sum(), pdf3['v2'].sum()]
...: assert chk == pchk
...: print("chk = ", chk)
df1 =
id1 id2 id3 id4 id5 id6 v1
0 1 891 60496 id1 id891 id60496 81.391571
1 1 380 787673 id1 id380 id787673 30.193613
2 1 807 325820 id1 id807 id325820 68.204899
3 1 410 196487 id1 id410 id196487 17.185598
4 1 903 123566 id1 id903 id123566 16.056599
.. ... ... ... ... ... ... ...
75 1 383 762796 id1 id383 id762796 38.322020
76 1 60 72812 id1 id60 id72812 81.327288
77 1 72 146864 id1 id72 id146864 54.909071
78 1 358 60397 id1 id358 id60397 37.021526
79 1 186 570607 id1 id186 id570607 25.336286
[80 rows x 7 columns]
df2 =
id1 id2 id4 id5 v2
0 1 34 id1 id34 58.434734
1 1 10 id1 id10 16.705822
2 1 821 id1 id821 46.066396
3 1 243 id1 id243 60.372908
4 1 824 id1 id824 53.858642
.. ... ... ... ... ...
75 2 1259 id2 id1259 37.180976
76 1 595 id1 id595 84.506771
77 1 510 id1 id510 1.905531
78 1 287 id1 id287 61.726441
79 1 600 id1 id600 18.217756
[80 rows x 5 columns]
(9, 11)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [4], in <cell line: 22>()
19 pdf3 = pdf1.merge(pdf2, on='id2')
20 print(df3.shape, flush=True)
---> 22 df_equals(df3, pdf3)
24 chk = [df3['v1'].sum(), df3['v2'].sum()]
25 pchk = [pdf3['v1'].sum(), pdf3['v2'].sum()]
File ~/Documents/modin/modin/pandas/test/utils.py:589, in df_equals(df1, df2)
586 assert_empty_frame_equal(df1, df2)
588 if isinstance(df1, pandas.DataFrame) and isinstance(df2, pandas.DataFrame):
--> 589 assert_frame_equal(
590 df1,
591 df2,
592 check_dtype=False,
593 check_datetimelike_compat=True,
594 check_index_type=False,
595 check_column_type=False,
596 check_categorical=False,
597 )
598 df_categories_equals(df1, df2)
599 elif isinstance(df1, pandas.Index) and isinstance(df2, pandas.Index):
[... skipping hidden 2 frame]
File ~/opt/anaconda3/envs/modin/lib/python3.9/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()
File ~/opt/anaconda3/envs/modin/lib/python3.9/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()
File ~/opt/anaconda3/envs/modin/lib/python3.9/site-packages/pandas/_testing/asserters.py:682, in raise_assert_detail(obj, message, left, right, diff, index_values)
679 if diff is not None:
680 msg += f"\n[diff]: {diff}"
--> 682 raise AssertionError(msg)
AssertionError: DataFrame.iloc[:, 1] (column name="id2") are different
DataFrame.iloc[:, 1] (column name="id2") values are different (66.66667 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8]
[left]: [767, 812, 287, 767, 906, 483, 812, 630, 686]
[right]: [767, 767, 812, 812, 287, 906, 483, 630, 686]
System information
Ubuntu 19.10
modin.__version__
):0.8.1.1+4.gbd8f73a
Python 3.8.3
Describe the problem
This is a simplified and stripped down code from join test in h2o benchmark. For me it produces different output for pandas and modin, rows are the same but order is different (note rows 3-6):
Source code / logs
Datafiles data.zip