Open dsaxton opened 4 years ago
Yea I would agree that we should not join on NA
There is an existing issue about the fact that right now we see NaNs as matching, which can blow up your merge if you have NaNs in both left and right key column (since you get a cross product combination of rows). I need to look up the issue, but I think there is some agreement that we don't like that behaviour, so that would be a nice thing to fix with NAs.
The issue is https://github.com/pandas-dev/pandas/issues/22491, but we apparently have many other related, possibly duplicate, issues (from a quick search): https://github.com/pandas-dev/pandas/issues/28522, https://github.com/pandas-dev/pandas/issues/22618, https://github.com/pandas-dev/pandas/issues/25138, https://github.com/pandas-dev/pandas/issues/7473 (we should probably clean that up a bit)
But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)
cc @jorisvandenbossche c @WillAyd
The behavior should be, that NaN does not match NaN for all nan types? This would break a few tests. Can we do that from the perspective of backwards compatibility or should we warn first before fixing this?
For backwards incompatible changes like this we would just go through a deprecation cycle first
Ok thanks very much, will have a Look in the coming days
I just ran into this problem. This behaviour is unlike any SQL join behaviour I have encountered so far, and completely unexpected, since usually NaN == NaN
evaluates to False
. I only caught this bug in my code by sheer luck.
What is the status of this?
In my opinion, this should be deprecated as fast as possible. Meanwhile, I think a big disclaimer should be put into the merge documentation that this is the current behaviour, as a warning to everyone using this.
This is highly non trivial since we have a lot of functions matching NaN with NaN. If we would want to do this we have to discuss the scope of the deprecation first.
@phofl happy to discuss this.
What do you mean by "a lot of functions matching NaN with NaN"? I can think of df.compare
, but there it makes sense to say np.Nan==np.Nan
. Are there other functions we need to consider?
In my opinion, our target should be to have the "correct" behaviour for all nullable dtypes, even though this means going through a deprecation cycle.
But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)
@jorisvandenbossche For the sake of conistency, I would propose making sure that behaviour for the new nullable dtypes does not differ from the old, even if that means that the behaviour initially is "wrong" for the new types. Otherwise, this might cause even more confusion.
Some examples:
Some examples:
- union
- drop_duplicates
- unique
- intersection
- and so on
I think all of these are fine as-is, because (at least in my opinion) they behave just as someone would expect if they were familiar with Python, but not pandas.
In contrast, the behaviour of df.merge
is the opposite of what one would usually expect. Looking at this thread and the similar ones that @mroeschke closed, I feel like there is universal agreement that the default should be to not match NaN with NaN in the case of a merge, or at least display a warning.
Solution proposal
A solution proposal would be to add an additional parameter to merge
: match_na
. This is how R's dplyr handles it. This way, the user can explicitely state what they want. To ensure backwards compatibility, in a first step, we could leave the default to True
, but display a warning if the parameter is not explicitely set and NA values match. The results would be:
match_na=True
.match_na=False
.At a later point, we could discuss about changing the default behaviour - however, I think as long as a warning is displayed and there is an easy option to switch behaviour, this might not even be necessary.
Note: As a comparison, based on this thread, Matlab seems to behave like SQL for joins.
I am not opposed to changing it. I looked into this a few months back but got stuck on all the cases to consider, because merge uses some of the functions mentioned above quite heavily. If we diverge there, this gets pretty difficult real quick.
So the discuission about this may be easy, but the implementation is not as straighforward without adding a lot of complexity. If youl would like to try this, I think this would certainly be welcome.
Has there been any additional discussion on this? I came across this behavior in my work, and I was genuinely surprised that this was the default behavior of Pandas.
The merge: match_na solution with a deprecation warning that @kasuteru proposed is a great idea.
In the interim, a warning should be put into the merge documentation. This should be a high priority...
Don't really have any suggestions here, but just wanted to expose this:
Currently pd.NA
and np.nan
are treated as the same key when merging, even though they are treated as different elsewhere (eg drop_duplicates()
) :
import numpy as np
import pandas as pd
left = pd.DataFrame({"a": ["a", "b", pd.NA]}, dtype=object)
right = pd.DataFrame({"b": ["c", np.nan, pd.NA]}, dtype=object)
pd.merge(left, right, how="inner", left_on="a", right_on="b")
# a b
# 0 <NA> NaN
# 1 <NA> <NA>
The proposed match_na
solution above would have to consider this case.
just checking what's the latest update on this? was it addressed?
The issue is still open
Not sure if this belongs here, but under certain conditions, NaN
also matches on non-NaN
values. Here is a small example (Pandas 2.1.4):
a = pd.DataFrame({
'idx_a':['A', 'A', 'B'],
'idx_b':['alpha', 'beta', 'gamma'],
'idx_c': [1.0, 1.0, 1.0],
'x':[10, 20, 30]
}).set_index(['idx_a', 'idx_b', 'idx_c'])
b = pd.DataFrame({
'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
'y':[100, 200, 300, 400, 500]
}).set_index(['idx_b', 'idx_c'])
c = a.join(
b,
how='inner',
on=['idx_b', 'idx_c']
)
print(a)
x
idx_a idx_b idx_c
A alpha 1.0 10
beta 1.0 20
B gamma 1.0 30
print(b)
y
idx_b idx_c
gamma 1.0 100
delta 1.0 200
epsilon 1.0 300
NaN 1.0 400
1.0 500
print(c)
x y
idx_a idx_b idx_c
B gamma 1.0 30 100
1.0 30 400
1.0 30 500
Mentioning here because other issues describing this behavior have been merged here (e.g., #25138), but all of the discussion above seems to refer only to NaN
matching on NaN
.
(Above is from 1.0.1 and master)
I think when joining on a nullable column we should not be matching
NA
withNA
and should only be joining where we have unambiguous equality (as in SQL). Also worth noting that this is the same as what happens when we haveNaN
which also seems incorrect, so could be an opportunity to fix this behavior?Expected Output
cc @jorisvandenbossche