Open jorisvandenbossche opened 7 years ago
changed here: https://github.com/pandas-dev/pandas/commit/e8d9e79fc7d0a31e8c37c82f1e48d51cce59e9e0
this tries to coerce keys back to the original dtype, might be buggy
So the question is, when not merging on a common column (in this case left:column and right:index), do we want both in the output, and should both be unique? (or only unique in the overlapping values?)
Some further exploration (with the same example as above):
Settting the right index to other values makes it clear the index and key column duplicates the 'merge' values:
In [7]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=[2,3,4])
In [8]: left.merge(right, left_on='key', right_index=True, how="right")
Out[8]:
key col_left col_right
2 2 NaN a
3 3 NaN b
4 4 NaN c
When the left frame is not fully empty, but just missing values in comparison to right frame, you also get strange behaviour:
In [10]: left = pd.DataFrame({'key':[3], 'col_left':['e']})
In [11]: left
Out[11]:
col_left key
0 e 3
In [12]: left.merge(right, left_on='key', right_index=True, how="right")
Out[12]:
col_left key col_right
0 e 3 b
0 NaN 2 a
0 NaN 4 c
The 'key' column is also filled like before, but now the index is gone (and not even default values, but all 0's)
Behaviour in 0.18 of the above:
In [18]: left = pd.DataFrame({'key':[3], 'col_left':['e']})
In [19]: left.merge(right, left_on='key', right_index=True, how="right")
Out[19]:
col_left key col_right
0 e 3.0 b
0 NaN 2.0 a
0 NaN 4.0 c
So the bug with the 0's in the index is the same, and also here the key column gets filled in (only not retained dtype, that is what was fixed in 0.19). So probably the filling of the key column is then the desired behaviour ?
The behaviour of the resulting index is rather buggy:
In [15]: left
Out[15]:
col_left key
0 e 3
## All zero's
In [16]: left.merge(right, left_on='key', right_index=True, how="right")
Out[16]:
col_left key col_right
0 e 3 b
0 NaN 2 a
0 NaN 4 c
In [17]: left = pd.DataFrame({'key':[2,3,4], 'col_left':['e', 'd', 'g']})
## Incorrect "default" index (or not sure where that values are coming from)
In [18]: left.merge(right, left_on='key', right_index=True, how="right")
Out[18]:
col_left key col_right
0 e 2 a
1 d 3 b
2 g 4 c
## This seems correct?
In [19]: left = pd.DataFrame({'key':[0,1,2,3,4,5], 'col_left':['e', 'd', 'g', 'h
...: ', 'b', 'p']})
In [20]: left.merge(right, left_on='key', right_index=True, how="right")
Out[20]:
col_left key col_right
2 g 2 a
3 h 3 b
4 b 4 c
yeah I would say the index of the result is wrong. It should be just a range index. Maybe not getting setup somehow.
In [10]: left
Out[10]:
col_left key
0 e 3
In [11]: right
Out[11]:
col_right
2 a
3 b
4 c
In [12]: left.merge(right, left_on='key', right_index=True, how="right", indicator=True)
Out[12]:
col_left key col_right _merge
0 e 3 b both
0 NaN 2 a right_only
0 NaN 4 c right_only
It should be just a range index.
It is a right join on the index (at least on the right index), so an option is also to preserve the right index. It makes that you have the merge values duplicate (in column and in index), but the same is true when merging on two different named columns.
BTW, when merging on two different columns, the key columns also don't get filled in on non-matching rows:
In [33]: left = pd.DataFrame({'key_left': [3], 'col_left': ['e']})
In [34]: right = pd.DataFrame({'col_right': ['a', 'b', 'c'], 'key_right':[2,3,4]
...: })
In [35]: left.merge(right, left_on='key_left', right_on='key_right', how='right'
...: )
Out[35]:
col_left key_left col_right key_right
0 e 3.0 b 3
1 NaN NaN a 2
2 NaN NaN c 4
So this is a further inconsistency with the left_on='key', right_index=True
case.
seems to work correctly when using dates for the keys ... except for the weird index.
left = pd.DataFrame({'key':[pd.to_datetime('2016-01-02')], 'col_left':['e']})
left
Out[78]:
col_left key
0 e 2016-01-02
right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=pd.date_range('2016-01-01', periods =3, freq = 'D'))
right
Out[79]:
col_right
2016-01-01 a
2016-01-02 b
2016-01-03 c
left.merge(right, how = 'right', left_on = 'key', right_index = True)
Out[80]:
col_left key col_right
0 e 2016-01-02 b
0 NaN 2016-01-01 a
0 NaN 2016-01-03 c
#now works correctly with reset_index
left.merge(right.reset_index(), how = 'right', left_on = 'key', right_on = 'index')
Out[81]:
col_left key index col_right
0 e 2016-01-02 2016-01-02 b
1 NaN NaT 2016-01-01 a
2 NaN NaT 2016-01-03 c
@randomgambit sorry, I don't see how this example is different from the example without dates. The buggy index (all 0's) is exactly one of the issues.
hello cher monsieur @jorisvandenbossche !
yes you are actually right. I misread my console. Sorry about that. Well, I guess my point is that this bug carries over also for datetime variables :) I also noticed this pb a couple of times, and I always do my merges after having reset the index anyways
No problem! Yes, indeed, the inconsistencies / special cases are mainly caused by mixing merging on column and index.
It is a rather specific corner case, but there has been a change in behaviour when merging an empty frame:
vs
So with 0.19 the
'key'
column has values, in 0.18 this holds NaNs. The key column comes from the empty frame (so it had no values, how can it have values now?), but is merged with the index of the left frame (and this has of course values -> should these end up in the 'key' column of the resulting frame?) It is such a strange case, that I am actually not sure which of both is the expected behaviour .. (and also not sure if this was an intentional change in behaviour).Encountered here: https://github.com/geopandas/geopandas/pull/422