pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.77k stars 17.63k forks source link

Joining on a nullable column #32306

Open dsaxton opened 4 years ago

dsaxton commented 4 years ago
import pandas as pd

left = pd.DataFrame({"a": [1, 2, pd.NA]}, dtype="Int64")
right = pd.DataFrame({"b": [1, 2, pd.NA]}, dtype="Int64")

pd.merge(left, right, how="inner", left_on="a", right_on="b")
#       a     b
# 0     1     1
# 1     2     2
# 2  <NA>  <NA>

(Above is from 1.0.1 and master)

I think when joining on a nullable column we should not be matching NA with NA and should only be joining where we have unambiguous equality (as in SQL). Also worth noting that this is the same as what happens when we have NaN which also seems incorrect, so could be an opportunity to fix this behavior?

pd.merge(left.astype(float), right.astype(float), how="inner", left_on="a", right_on="b")
#      a    b
# 0  1.0  1.0
# 1  2.0  2.0
# 2  NaN  NaN

Expected Output

      a     b
0     1     1
1     2     2

cc @jorisvandenbossche

WillAyd commented 4 years ago

Yea I would agree that we should not join on NA

jorisvandenbossche commented 4 years ago

There is an existing issue about the fact that right now we see NaNs as matching, which can blow up your merge if you have NaNs in both left and right key column (since you get a cross product combination of rows). I need to look up the issue, but I think there is some agreement that we don't like that behaviour, so that would be a nice thing to fix with NAs.

jorisvandenbossche commented 4 years ago

The issue is https://github.com/pandas-dev/pandas/issues/22491, but we apparently have many other related, possibly duplicate, issues (from a quick search): https://github.com/pandas-dev/pandas/issues/28522, https://github.com/pandas-dev/pandas/issues/22618, https://github.com/pandas-dev/pandas/issues/25138, https://github.com/pandas-dev/pandas/issues/7473 (we should probably clean that up a bit)

But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)

phofl commented 3 years ago

cc @jorisvandenbossche c @WillAyd

The behavior should be, that NaN does not match NaN for all nan types? This would break a few tests. Can we do that from the perspective of backwards compatibility or should we warn first before fixing this?

WillAyd commented 3 years ago

For backwards incompatible changes like this we would just go through a deprecation cycle first

phofl commented 3 years ago

Ok thanks very much, will have a Look in the coming days

kasuteru commented 2 years ago

I just ran into this problem. This behaviour is unlike any SQL join behaviour I have encountered so far, and completely unexpected, since usually NaN == NaN evaluates to False. I only caught this bug in my code by sheer luck.

What is the status of this?

In my opinion, this should be deprecated as fast as possible. Meanwhile, I think a big disclaimer should be put into the merge documentation that this is the current behaviour, as a warning to everyone using this.

phofl commented 2 years ago

This is highly non trivial since we have a lot of functions matching NaN with NaN. If we would want to do this we have to discuss the scope of the deprecation first.

kasuteru commented 2 years ago

@phofl happy to discuss this.

What do you mean by "a lot of functions matching NaN with NaN"? I can think of df.compare, but there it makes sense to say np.Nan==np.Nan. Are there other functions we need to consider?

In my opinion, our target should be to have the "correct" behaviour for all nullable dtypes, even though this means going through a deprecation cycle.

But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)

@jorisvandenbossche For the sake of conistency, I would propose making sure that behaviour for the new nullable dtypes does not differ from the old, even if that means that the behaviour initially is "wrong" for the new types. Otherwise, this might cause even more confusion.

phofl commented 2 years ago

Some examples:

kasuteru commented 2 years ago

Some examples:

  • union
  • drop_duplicates
  • unique
  • intersection
  • and so on

I think all of these are fine as-is, because (at least in my opinion) they behave just as someone would expect if they were familiar with Python, but not pandas.

In contrast, the behaviour of df.merge is the opposite of what one would usually expect. Looking at this thread and the similar ones that @mroeschke closed, I feel like there is universal agreement that the default should be to not match NaN with NaN in the case of a merge, or at least display a warning.

Solution proposal A solution proposal would be to add an additional parameter to merge: match_na. This is how R's dplyr handles it. This way, the user can explicitely state what they want. To ensure backwards compatibility, in a first step, we could leave the default to True, but display a warning if the parameter is not explicitely set and NA values match. The results would be:

At a later point, we could discuss about changing the default behaviour - however, I think as long as a warning is displayed and there is an easy option to switch behaviour, this might not even be necessary.

Note: As a comparison, based on this thread, Matlab seems to behave like SQL for joins.

phofl commented 2 years ago

I am not opposed to changing it. I looked into this a few months back but got stuck on all the cases to consider, because merge uses some of the functions mentioned above quite heavily. If we diverge there, this gets pretty difficult real quick.

So the discuission about this may be easy, but the implementation is not as straighforward without adding a lot of complexity. If youl would like to try this, I think this would certainly be welcome.

TheGustafson commented 2 years ago

Has there been any additional discussion on this? I came across this behavior in my work, and I was genuinely surprised that this was the default behavior of Pandas.

The merge: match_na solution with a deprecation warning that @kasuteru proposed is a great idea.

In the interim, a warning should be put into the merge documentation. This should be a high priority...

NickCrews commented 2 years ago

Don't really have any suggestions here, but just wanted to expose this:

Currently pd.NA and np.nan are treated as the same key when merging, even though they are treated as different elsewhere (eg drop_duplicates()) :

import numpy as np
import pandas as pd

left = pd.DataFrame({"a": ["a", "b", pd.NA]}, dtype=object)
right = pd.DataFrame({"b": ["c", np.nan, pd.NA]}, dtype=object)

pd.merge(left, right, how="inner", left_on="a", right_on="b")
#      a     b
# 0  <NA>   NaN
# 1  <NA>  <NA>

The proposed match_na solution above would have to consider this case.

fg-pt commented 1 year ago

just checking what's the latest update on this? was it addressed?

phofl commented 1 year ago

The issue is still open

tcquinn commented 6 months ago

Not sure if this belongs here, but under certain conditions, NaN also matches on non-NaN values. Here is a small example (Pandas 2.1.4):

a = pd.DataFrame({
    'idx_a':['A', 'A', 'B'],
    'idx_b':['alpha', 'beta', 'gamma'],
    'idx_c': [1.0, 1.0, 1.0],
    'x':[10, 20, 30]
}).set_index(['idx_a', 'idx_b', 'idx_c'])

b = pd.DataFrame({
    'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
    'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
    'y':[100, 200, 300, 400, 500]
}).set_index(['idx_b', 'idx_c'])

c = a.join(
    b,
    how='inner',
    on=['idx_b', 'idx_c']
)

print(a)
                    x
idx_a idx_b idx_c    
A     alpha 1.0    10
      beta  1.0    20
B     gamma 1.0    30

print(b)
                y
idx_b   idx_c     
gamma   1.0    100
delta   1.0    200
epsilon 1.0    300
NaN     1.0    400
        1.0    500

print(c)
                    x    y
idx_a idx_b idx_c         
B     gamma 1.0    30  100
            1.0    30  400
            1.0    30  500

Mentioning here because other issues describing this behavior have been merged here (e.g., #25138), but all of the discussion above seems to refer only to NaN matching on NaN.