pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

BUG: groupby().any() returns True instead of False for groups where timedelta column is all null #59712

Closed sfc-gh-mvashishtha closed 1 month ago

sfc-gh-mvashishtha commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd

pd.DataFrame([pd.Timedelta(1), pd.NaT]).groupby([0, 1]).any()

Issue Description

For other dtypes, like integers and strings, groupby().any() returns True for groups where all the values are null, e.g.

pd.DataFrame([1, None]).groupby([0, 1]).any()

pd.DataFrame(["a", None]).groupby([0, 1]).any()

Expected Behavior

groupby().any() should return False for groups where all the timedelta values are null.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.18.final.0 python-bits : 64 OS : Darwin OS-release : 23.6.0 Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.4 qtpy : None pyqt5 : None
rhshadrach commented 2 months ago

Thanks for the report! Confirmed on main. It appears to me the issue is that we view the values as integers prior to computing mask of whether the values are NA or not.

https://github.com/pandas-dev/pandas/blob/80b685027108245086b78dbd9a176b096c92570a/pandas/core/groupby/ops.py#L375-L384

I believe switching the order of these will resolve the bug, PRs to fix are welcome!

rhshadrach commented 2 months ago

Hopefully this is an easy fix (and will need a test!), so marking as a good first issue.

vivrdprasanna commented 2 months ago

take

40gilad commented 1 month ago

took it !

vivrdprasanna commented 1 month ago

Hey @40gilad , as a first time contributor, I'd hoped to take a stab at this issue. I see that you took it upon yourself to submit a proposed fix. No worries at all - I'll move onto another issue, but just wanted to flag this for future reference.

Rahul20037237 commented 1 month ago

take

Petroncini commented 1 month ago

take

prafulmaka commented 1 month ago

Whats still pending to do here?

mpvaldez commented 1 month ago

Hi! Can I take this? I want to start collaborating and I found the solution