pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.57k stars 17.56k forks source link

DataFrameGroupBy.agg with nan results into inf #59106

Open glaucouri opened 4 days ago

glaucouri commented 4 days ago

Pandas version checks

Reproducible Example

import pandas as pd

pd.__version__
# '2.2.0'

df = pd.DataFrame({'A':['A','B'], 'B':[0,0]}).astype({'A':'string','B':'Float32'})
df['C'] = df['B']/df['B']

df.groupby('A')['C'].agg(['max','min'])

# 
#    max  min
# A          
# A -inf  inf
# B -inf  inf

df.groupby('A')['C'].max() # the same with .agg(max)

# A
# A   -inf
# B   -inf
# Name: C, dtype: Float32

df.groupby('A')['C'].min()  # the same with .agg(min)

# A
# A    inf
# B    inf
# Name: C, dtype: Float32

Issue Description

DataFrameGroupBy.agg handles poorly nan.

Unfortunately, sometimes happens that some nullable fields have some nan. cfr: https://github.com/pandas-dev/pandas/issues/32265

And this case falls into unexpected behavior in conjunction with groupby.

In a nutshell:

Having nan into a Float field make the groupby()[min/max] computation wrong

Expected Behavior

From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan

semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value"

an aggregation (via groupby) of nan, should result into nan

Installed Versions

INSTALLED VERSIONS ------------------ commit : fd3f57170aa1af588ba877e8e28c158a20a4886d python : 3.10.13.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-186-generic Version : #206-Ubuntu SMP Fri Apr 26 12:31:10 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.0 numpy : 1.23.5 pytz : 2024.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.1 Cython : 0.29.37 pytest : 7.4.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.1.9 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.22.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : 2024.2.0 fsspec : 2024.6.0 gcsfs : 2024.6.0 matplotlib : 3.7.5 numba : 0.59.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : 2.0.28 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
avm19 commented 4 days ago

I believe the code you are complaining about is in these lines (but it's worth double-checking): https://github.com/pandas-dev/pandas/blob/bef88efe999809b775ed88a02f0fc2fd6d2d08a2/pandas/core/_numba/kernels/min_max_.py#L106 https://github.com/pandas-dev/pandas/blob/bef88efe999809b775ed88a02f0fc2fd6d2d08a2/pandas/core/_numba/kernels/min_max_.py#L49-L52


Let me explain how I think your issue report could be modified and why.

Scratch that > From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan > > semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value" > an aggregation (via groupby) of nan, should result into nan We need to be precise in the quantification **all vs any** when dealing with aggregation or reduction. NumPy follows "any nan implies nan": ```python np.array([0, 1, 2]).max() # 2 (no nan => no nan) np.array([0, 1, np.inf]).max() # inf (no nan => no nan) np.array([np.nan, 0, 1, np.inf]).max() # nan (some nan => nan) (np.array([0, 1, np.inf]) / np.array([0, 1, np.inf])).max() # nan (some nan => nan) ``` Pandas follows "all NA implies NA": ```python pd.Series([0, 1, 2], dtype='Float64').max() # 2 (no NA => no NA) pd.Series([0, 1, np.inf], dtype='Float64').max() # inf (no NA => no NA) pd.Series([np.nan, 0, 1, np.inf], dtype='Float64').max() # inf (some NA =/=> NA) pd.Series([np.nan, np.nan, np.nan], dtype='Float64').max() # (all NA => NA) ``` To complicate the matter, Pandas treats `NA` and `np.nan` sometimes differently and sometimes not. It is still being decided by seniors in #32265 (which you referenced) what exactly the semantics of `NA` and `np.nan` should be in Pandas. The consensus tends to be that `NA` is a _missing_ value, while `np.nan` is a _bad_ value. In most cases, missing values can be simply ignored, unlike bad values. This explains why in a single bad value `np.nan` ruins the computation in NumPy, while a single missing value `pd.NA` does not do the same in Pandas. Now, to complicate the matter even further, Pandas transforms `np.nan` into `pd.NA`: ```python s = pd.Series([np.nan, 0, 1], dtype="Float64") s.max() # 1.0, because max([, 0, 1]) is 1 (s / s).max() # , because max([, np.nan, 1]) is np.nan which becomes ``` - In the first line, Pandas tranforms `np.nan` (which historically denoted a missing value in Pandas, before nullable arrays were introduced). So the Series is `[, 0, 1]` - In the second line, as expected and as we saw before, a missing value is simply ignored: `max([, 0, 1])` gives `1`. - In the third line, `s / s` becomes `[, np.nan, 1]`, where `0 / 0` or `np.nan` is a _bad_ value, which _must_ derail the aggregation, so `max([, np.nan, 1])` gives `np.nan`. But this is not the end. For some reason, Pandas converts `np.nan` again to ``. ~~The expected behaviour you propose, @glaucouri, would equate `pd.NA` to `np.nan`. I don't think the council of maintainers would support this. Therefore I suggest to reframe your issue differently:~~

I misunderstood your suggestion initially. You indeed insist on treating np.nan as an invalid value consistently in aggregation functions. I personally care more about consistency, so here is another example of the supposed bug:

s = pd.Series([np.nan, 0, 1], dtype="Float64")
(s / s).max()  # <NA>
(s / s).groupby([9, 9, 9]).max().iat[0]  # 1.0

The last two lines were expected to give the same result (whatever it should be).