pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.82k stars 17.99k forks source link

BUG? merging on column of empty frame with index of right frame #15692

Open jorisvandenbossche opened 7 years ago

jorisvandenbossche commented 7 years ago

It is a rather specific corner case, but there has been a change in behaviour when merging an empty frame:

In [1]: pd.__version__
Out[1]: '0.19.2'

In [2]: left = pd.DataFrame(columns=['key', 'col_left'])

In [3]: left
Out[3]: 
Empty DataFrame
Columns: [key, col_left]
Index: []

In [4]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']})

In [5]: right
Out[5]: 
  col_right
0         a
1         b
2         c

In [6]: left.merge(right, left_on='key', right_index=True, how="right")
Out[6]: 
   key col_left col_right
0    0      NaN         a
1    1      NaN         b
2    2      NaN         c

vs

In [10]: pd.__version__
Out[10]: u'0.18.1'

In [11]: left = pd.DataFrame(columns=['key', 'col_left'])

In [12]: left
Out[12]: 
Empty DataFrame
Columns: [key, col_left]
Index: []

In [13]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']})

In [14]: right
Out[14]: 
  col_right
0         a
1         b
2         c

In [15]: left.merge(right, left_on='key', right_index=True, how="right")
Out[15]: 
   key col_left col_right
0  NaN      NaN         a
1  NaN      NaN         b
2  NaN      NaN         c

So with 0.19 the 'key' column has values, in 0.18 this holds NaNs. The key column comes from the empty frame (so it had no values, how can it have values now?), but is merged with the index of the left frame (and this has of course values -> should these end up in the 'key' column of the resulting frame?) It is such a strange case, that I am actually not sure which of both is the expected behaviour .. (and also not sure if this was an intentional change in behaviour).

Encountered here: https://github.com/geopandas/geopandas/pull/422

jreback commented 7 years ago

changed here: https://github.com/pandas-dev/pandas/commit/e8d9e79fc7d0a31e8c37c82f1e48d51cce59e9e0

this tries to coerce keys back to the original dtype, might be buggy

jorisvandenbossche commented 7 years ago

So the question is, when not merging on a common column (in this case left:column and right:index), do we want both in the output, and should both be unique? (or only unique in the overlapping values?)

Some further exploration (with the same example as above):

Settting the right index to other values makes it clear the index and key column duplicates the 'merge' values:

In [7]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=[2,3,4])

In [8]: left.merge(right, left_on='key', right_index=True, how="right")
Out[8]: 
   key col_left col_right
2    2      NaN         a
3    3      NaN         b
4    4      NaN         c

When the left frame is not fully empty, but just missing values in comparison to right frame, you also get strange behaviour:

In [10]: left = pd.DataFrame({'key':[3], 'col_left':['e']})

In [11]: left
Out[11]: 
  col_left  key
0        e    3

In [12]: left.merge(right, left_on='key', right_index=True, how="right")
Out[12]: 
  col_left  key col_right
0        e    3         b
0      NaN    2         a
0      NaN    4         c

The 'key' column is also filled like before, but now the index is gone (and not even default values, but all 0's)

jorisvandenbossche commented 7 years ago

Behaviour in 0.18 of the above:

In [18]: left = pd.DataFrame({'key':[3], 'col_left':['e']})

In [19]: left.merge(right, left_on='key', right_index=True, how="right")
Out[19]: 
  col_left  key col_right
0        e  3.0         b
0      NaN  2.0         a
0      NaN  4.0         c

So the bug with the 0's in the index is the same, and also here the key column gets filled in (only not retained dtype, that is what was fixed in 0.19). So probably the filling of the key column is then the desired behaviour ?

jorisvandenbossche commented 7 years ago

The behaviour of the resulting index is rather buggy:

In [15]: left
Out[15]: 
  col_left  key
0        e    3

## All zero's

In [16]: left.merge(right, left_on='key', right_index=True, how="right")
Out[16]: 
  col_left  key col_right
0        e    3         b
0      NaN    2         a
0      NaN    4         c

In [17]: left = pd.DataFrame({'key':[2,3,4], 'col_left':['e', 'd', 'g']})

## Incorrect "default" index (or not sure where that values are coming from)

In [18]: left.merge(right, left_on='key', right_index=True, how="right")
Out[18]: 
  col_left  key col_right
0        e    2         a
1        d    3         b
2        g    4         c

## This seems correct?

In [19]: left = pd.DataFrame({'key':[0,1,2,3,4,5], 'col_left':['e', 'd', 'g', 'h
    ...: ', 'b', 'p']})

In [20]: left.merge(right, left_on='key', right_index=True, how="right")
Out[20]: 
  col_left  key col_right
2        g    2         a
3        h    3         b
4        b    4         c
jreback commented 7 years ago

yeah I would say the index of the result is wrong. It should be just a range index. Maybe not getting setup somehow.

In [10]: left
Out[10]: 
  col_left  key
0        e    3

In [11]: right
Out[11]: 
  col_right
2         a
3         b
4         c

In [12]: left.merge(right, left_on='key', right_index=True, how="right", indicator=True)
Out[12]: 
  col_left  key col_right      _merge
0        e    3         b        both
0      NaN    2         a  right_only
0      NaN    4         c  right_only
jorisvandenbossche commented 7 years ago

It should be just a range index.

It is a right join on the index (at least on the right index), so an option is also to preserve the right index. It makes that you have the merge values duplicate (in column and in index), but the same is true when merging on two different named columns.

BTW, when merging on two different columns, the key columns also don't get filled in on non-matching rows:

In [33]: left = pd.DataFrame({'key_left': [3], 'col_left': ['e']})

In [34]: right = pd.DataFrame({'col_right': ['a', 'b', 'c'], 'key_right':[2,3,4]
    ...: })

In [35]: left.merge(right, left_on='key_left', right_on='key_right', how='right'
    ...: )
Out[35]: 
  col_left  key_left col_right  key_right
0        e       3.0         b          3
1      NaN       NaN         a          2
2      NaN       NaN         c          4

So this is a further inconsistency with the left_on='key', right_index=True case.

randomgambit commented 7 years ago

seems to work correctly when using dates for the keys ... except for the weird index.


left = pd.DataFrame({'key':[pd.to_datetime('2016-01-02')], 'col_left':['e']})
left
Out[78]: 
  col_left        key
0        e 2016-01-02

right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=pd.date_range('2016-01-01', periods =3, freq = 'D'))
right
Out[79]: 
           col_right
2016-01-01         a
2016-01-02         b
2016-01-03         c

left.merge(right, how = 'right', left_on = 'key', right_index = True)
Out[80]: 
  col_left        key col_right
0        e 2016-01-02         b
0      NaN 2016-01-01         a
0      NaN 2016-01-03         c

#now works correctly with reset_index
left.merge(right.reset_index(), how = 'right', left_on = 'key', right_on = 'index')
Out[81]: 
  col_left        key      index col_right
0        e 2016-01-02 2016-01-02         b
1      NaN        NaT 2016-01-01         a
2      NaN        NaT 2016-01-03         c
jorisvandenbossche commented 7 years ago

@randomgambit sorry, I don't see how this example is different from the example without dates. The buggy index (all 0's) is exactly one of the issues.

randomgambit commented 7 years ago

hello cher monsieur @jorisvandenbossche !

yes you are actually right. I misread my console. Sorry about that. Well, I guess my point is that this bug carries over also for datetime variables :) I also noticed this pb a couple of times, and I always do my merges after having reset the index anyways

jorisvandenbossche commented 7 years ago

No problem! Yes, indeed, the inconsistencies / special cases are mainly caused by mixing merging on column and index.