pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.18k stars 17.77k forks source link

BUG: groupby doesn't distinguish between different kinds of null values #48476

Open rhshadrach opened 1 year ago

rhshadrach commented 1 year ago
df = pd.DataFrame({'a': [np.nan, pd.NA, None], 'b': [1, 2, 3]})
gb = df.groupby('a', dropna=False)
print(gb.sum())

#      b
# a     
# NaN  6

The three types of null values currently get combined into a single group. There are various places in pandas where different types of null values are identified, e.g. pd.Series([np.nan, None]) converts None to np.nan. However within groupby, I think if the input contains distinct values, then they should remain distinct in the groupby result.

In order to change this, the change will need to be made in factorize, which could impact other parts outside of groupby. Our tests didn't catch any such instances (see #48477 for an implementation), but I plan to look into this further.

Assuming we consider the current output undesirable, we need to decide if we are going to call this a bug or if it should go through deprecation. This is tested for in test_groupby_dropna_multi_index_dataframe_nan_in_two_groups, which I think might suggest deprecating, however that was added in the original PR that introduced dropna to groupby (#30584). I went through that PR and did not see any discussion on this behavior. That makes me lean toward calling it a bug, but I could go either way here.

Slightly related: https://github.com/pandas-dev/pandas/issues/32265

cc @charlesdong1991 @jorisvandenbossche @jbrockmendel @mroeschke @jreback for any thoughts.

phofl commented 1 year ago

I had a similar discussion with brock yesterday, in masked arrays, None, np.nan and pd.NA are all treated equally. So if you decide to make changes, you have to consider this and keep the behaviour there as is. In Numpy Dtypes this might be a different story.

jreback commented 1 year ago

this would be a massive change and will cause quite some issue i susoect and will break user code

need a proposal on how to handle this generally rather than local patches

jbrockmendel commented 1 year ago

In case anyone is unaware: is_valid_na_for_dtype and is_matching_na may be relevant

32265 will be a PITA until the end of time.

mroeschke commented 1 year ago

Generally agreed with the other's sentiment of if different nulls are distinct in groupby they should probably be distinct in other ops as well.

FWIW, I am generally +1 to making NA distinct from np.nan (pro https://github.com/pandas-dev/pandas/issues/32265) distinct from None.

jorisvandenbossche commented 1 year ago

Just to be clear about the scope of this discussion: I think this is limited to object dtype, right? It's only when you have object dtype that you can have multiple missing sentinels. (leaving aside the discussion about distinguishing NaN from NA in nullable float for a moment, and also if we distinguish them, NaN would probably not be considered as a null value)

I agree that we should see this in a more general context, and not just limited to groupby (also things like count, unique, value_counts, duplicated, etc in non-groupby context).
But, testing some behaviours, we actually did change the behaviour related to this somewhat recently in 1.4 in value_counts:

With pandas 1.3:

In [1]: s = pd.Series([np.nan, None], dtype=object)

In [2]: s.value_counts(dropna=False)
Out[2]: 
NaN    2
dtype: int64

With main:

In [1]: s = pd.Series([np.nan, None], dtype=object)

In [2]: s.value_counts(dropna=False)
Out[2]: 
NaN    1
NaN    1
dtype: int64

See https://pandas.pydata.org/docs/dev/whatsnew/v1.4.0.html#null-values-are-no-longer-coerced-to-nan-value-in-value-counts-and-mode

So if we do it there, it might also makes sense to follow that in groupby?

jorisvandenbossche commented 1 year ago

had a similar discussion with brock yesterday, in masked arrays, None, np.nan and pd.NA are all treated equally. So if you decide to make changes, you have to consider this and keep the behaviour there as is. In Numpy Dtypes this might be a different story.

I think we should distinguish two different aspects here: 1) how are different null sentinels treated in case of input to a method or operation (how are they coerced ), and 2) how do we treat different null sentinels if they are present in an already typed DataFrame object. I think this issue is mostly about the second aspect: the original example of the top post is the second case of having a DataFrame with different sentinels. While in the context of masked arrays as brought up in this quote, this is more about coercion of input (the first case), since once you have a masked array, you by theory can never have multiple null sentinels (again, leaving aside the NaN/NA discussion https://github.com/pandas-dev/pandas/issues/32265).

For example, using the top post example but forcing to use a nullable dtype:

In [3]: df = pd.DataFrame({'a': [np.nan, pd.NA, None], 'b': [1, 2, 3]}, dtype="Int64")

In [4]: df
Out[4]: 
      a  b
0  <NA>  1
1  <NA>  2
2  <NA>  3

In [5]: df.groupby('a', dropna=False).sum()
Out[5]: 
      b
a      
<NA>  6

The question here is whether the first line is the behaviour we want (i.e. the fact that all of NaN, NA and None are coerced into NA). But once you have that dataframe, the actual groupby step is of course correct to put all NAs in a single group.