Open rhshadrach opened 1 year ago
I had a similar discussion with brock yesterday, in masked arrays, None, np.nan and pd.NA are all treated equally. So if you decide to make changes, you have to consider this and keep the behaviour there as is. In Numpy Dtypes this might be a different story.
this would be a massive change and will cause quite some issue i susoect and will break user code
need a proposal on how to handle this generally rather than local patches
In case anyone is unaware: is_valid_na_for_dtype and is_matching_na may be relevant
Generally agreed with the other's sentiment of if different nulls are distinct in groupby they should probably be distinct in other ops as well.
FWIW, I am generally +1 to making NA
distinct from np.nan
(pro https://github.com/pandas-dev/pandas/issues/32265) distinct from None
.
Just to be clear about the scope of this discussion: I think this is limited to object
dtype, right? It's only when you have object dtype that you can have multiple missing sentinels.
(leaving aside the discussion about distinguishing NaN from NA in nullable float for a moment, and also if we distinguish them, NaN would probably not be considered as a null value)
I agree that we should see this in a more general context, and not just limited to groupby (also things like count, unique, value_counts, duplicated, etc in non-groupby context).
But, testing some behaviours, we actually did change the behaviour related to this somewhat recently in 1.4 in value_counts
:
With pandas 1.3:
In [1]: s = pd.Series([np.nan, None], dtype=object)
In [2]: s.value_counts(dropna=False)
Out[2]:
NaN 2
dtype: int64
With main:
In [1]: s = pd.Series([np.nan, None], dtype=object)
In [2]: s.value_counts(dropna=False)
Out[2]:
NaN 1
NaN 1
dtype: int64
So if we do it there, it might also makes sense to follow that in groupby?
had a similar discussion with brock yesterday, in masked arrays, None, np.nan and pd.NA are all treated equally. So if you decide to make changes, you have to consider this and keep the behaviour there as is. In Numpy Dtypes this might be a different story.
I think we should distinguish two different aspects here: 1) how are different null sentinels treated in case of input to a method or operation (how are they coerced ), and 2) how do we treat different null sentinels if they are present in an already typed DataFrame object. I think this issue is mostly about the second aspect: the original example of the top post is the second case of having a DataFrame with different sentinels. While in the context of masked arrays as brought up in this quote, this is more about coercion of input (the first case), since once you have a masked array, you by theory can never have multiple null sentinels (again, leaving aside the NaN/NA discussion https://github.com/pandas-dev/pandas/issues/32265).
For example, using the top post example but forcing to use a nullable dtype:
In [3]: df = pd.DataFrame({'a': [np.nan, pd.NA, None], 'b': [1, 2, 3]}, dtype="Int64")
In [4]: df
Out[4]:
a b
0 <NA> 1
1 <NA> 2
2 <NA> 3
In [5]: df.groupby('a', dropna=False).sum()
Out[5]:
b
a
<NA> 6
The question here is whether the first line is the behaviour we want (i.e. the fact that all of NaN, NA and None are coerced into NA). But once you have that dataframe, the actual groupby step is of course correct to put all NAs in a single group.
The three types of null values currently get combined into a single group. There are various places in pandas where different types of null values are identified, e.g.
pd.Series([np.nan, None])
convertsNone
tonp.nan
. However within groupby, I think if the input contains distinct values, then they should remain distinct in the groupby result.In order to change this, the change will need to be made in
factorize
, which could impact other parts outside of groupby. Our tests didn't catch any such instances (see #48477 for an implementation), but I plan to look into this further.Assuming we consider the current output undesirable, we need to decide if we are going to call this a bug or if it should go through deprecation. This is tested for in
test_groupby_dropna_multi_index_dataframe_nan_in_two_groups
, which I think might suggest deprecating, however that was added in the original PR that introduceddropna
to groupby (#30584). I went through that PR and did not see any discussion on this behavior. That makes me lean toward calling it a bug, but I could go either way here.Slightly related: https://github.com/pandas-dev/pandas/issues/32265
cc @charlesdong1991 @jorisvandenbossche @jbrockmendel @mroeschke @jreback for any thoughts.