pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG: skipna=True operations don't skip NaN in FloatingArrays #59965

Open carlocastoldi opened 1 month ago

carlocastoldi commented 1 month ago

Pandas version checks

Reproducible Example

import pandas as pd

s1 = pd.Series({"a": 0.0, "b": 1, "c": 1, "d": 0})
s2 = pd.Series({"a": 0.0, "b": 2, "c": 2, "d": 2})
s3 = s1/s2
#display(s3)
s4 = s1.convert_dtypes()/s2.convert_dtypes()
#display(s4)
s5 = pd.Series([None,0.5,0.5,0]).convert_dtypes()
#display(s5)
s3.mean(skipna=True), s4.mean(skipna=True), s5.mean(skipna=True)

Issue Description

Following #59961, I understand that series/dataframes of FloatingArrays cointaing np.NaN values are possible and meant to exists. These very same dataframes/series, however, fail to skip NaN values when asked to. The above examples outputs:

(np.float64(0.3333333333333333), <NA>, np.float64(0.3333333333333333))

Expected Behavior

>>> s4.mean(skipna=True)
np.float64(0.3333333333333333)

Installed Versions

INSTALLED VERSIONS ------------------ commit : 139def2145b83d40364235c6297e1833eab7bb05 python : 3.12.3 python-bits : 64 OS : Linux OS-release : 6.8.0-41-generic Version : #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 2 20:41:06 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 3.0.0.dev0+1545.g139def2145 numpy : 2.2.0.dev0+git20240930.3ee9e6a dateutil : 2.9.0.post0 pip : 24.0 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pytz : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
rhshadrach commented 1 month ago

Thanks for the report! It looks like we should be overriding the _reduce method in FloatingArray to properly handle this case. Further investigations and PRs to fix are welcome!

cooolheater commented 1 month ago

As I checked, the issue occurs only if isinstance(delegate, ExtensionArray) case of Series::_reduce. Despite the skipna is passed to _reductions, it is not applied to mask.

So, to apply skipna=True, the 'isna' need to be applied to 'mask'. So I suggested a PR fixing that.

rhshadrach commented 4 weeks ago

When I first triaged this issue, I was not aware of #53887. Since pd.isna does not pick up on the NaN values, I am wondering if skipna=True should skip them. Do we think this should wait for #58988?

cc @jorisvandenbossche