BUG: groupby().any() returns True instead of False for groups where timedelta column is all null

sfc-gh-mvashishtha commented 2 months ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.DataFrame([pd.Timedelta(1), pd.NaT]).groupby([0, 1]).any()

Issue Description

For other dtypes, like integers and strings, groupby().any() returns True for groups where all the values are null, e.g.

pd.DataFrame([1, None]).groupby([0, 1]).any()

pd.DataFrame(["a", None]).groupby([0, 1]).any()

Expected Behavior

groupby().any() should return False for groups where all the timedelta values are null.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.18.final.0 python-bits : 64 OS : Darwin OS-release : 23.6.0 Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.4 qtpy : None pyqt5 : None

rhshadrach commented 2 months ago

Thanks for the report! Confirmed on main. It appears to me the issue is that we view the values as integers prior to computing mask of whether the values are NA or not.

https://github.com/pandas-dev/pandas/blob/80b685027108245086b78dbd9a176b096c92570a/pandas/core/groupby/ops.py#L375-L384

I believe switching the order of these will resolve the bug, PRs to fix are welcome!

rhshadrach commented 2 months ago

Hopefully this is an easy fix (and will need a test!), so marking as a good first issue.

vivrdprasanna commented 2 months ago

take

40gilad commented 1 month ago

took it !

vivrdprasanna commented 1 month ago

Hey @40gilad , as a first time contributor, I'd hoped to take a stab at this issue. I see that you took it upon yourself to submit a proposed fix. No worries at all - I'll move onto another issue, but just wanted to flag this for future reference.