pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.83k stars 18k forks source link

BUG: `DataFrameGroupBy.agg` has different behavior when input function receives keyword argument #60379

Closed mldcastro closed 12 hours ago

mldcastro commented 1 day ago

Pandas version checks

Reproducible Example

import pandas as pd

df = pd.DataFrame({"a": [1, 2], "b": [4, 5]}, index=pd.Index(list("AB"), name="group"))

def _agg_func(*args, **kwargs):
    for arg in args:
        print("ARG:\n")
        print(arg)
        print()
    return None

# _agg_func will receive a pandas Series of column "a", then after it, column "b"
df.groupby(level="group").agg(_agg_func)

# _agg_func will receive a DataFrame with columns "a" and "b" for each group
df.groupby(level="group").agg(_agg_func, some_kwarg="abc")

Issue Description

When performing an aggregation over groups, my colleagues and I observed that DataFrameGroupBy.agg has inconsistent behavior when the function passed to the agg method has or not keyword arguments.

If agg receives the function _agg_func without keyword arguments, then the input to _agg_func will be a pandas.Series. If keyword arguments are passed, then the input to _agg_func is a pandas.DataFrame.

Expected Behavior

As per the docs, the parameter func should be able to receive a pandas.DataFrame as input, hence, I would expect that the input for _agg_func should always be a pandas.DataFrame, independently if the function receives or not a keyword argument.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.11.10 python-bits : 64 OS : Linux OS-release : 5.15.133.1-microsoft-standard-WSL2 Version : #1 SMP Thu Oct 5 21:02:42 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.0.1 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 23.3.1 Cython : None sphinx : None IPython : 8.29.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.4 lxml.etree : None matplotlib : 3.9.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : 2.9.9 pymysql : None pyarrow : 18.0.0 pyreadstat : None pytest : 8.3.2 python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.1 sqlalchemy : 2.0.32 tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
rhshadrach commented 12 hours ago

Thanks for the report, closing as a duplicate of https://github.com/pandas-dev/pandas/issues/39169.