pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.74k stars 17.95k forks source link

BUG: Adding a random function argument to DataFrameGroupBy.aggregate changes the grouping object from Series to DataFrame. #47701

Open montali opened 2 years ago

montali commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd

def typer_with_random_arg(x, random_arg):
  print(type(x))
def typer_no_args(x):
  print(type(x))

a = pd.DataFrame({'A': [1,4], 'B':[5,10]}, index=['A', 'B'])
a.groupby('A').agg(typer_with_random_arg,random_arg=True)
a.groupby('A').agg(typer_no_args)

Issue Description

Hi! While building a custom aggregation function to use with pandas.core.groupby.DataFrameGroupBy.aggregate, I noticed a really weird behaviour: the .aggregate() function provides an *arg parameter to provide args to the grouping function, but this, which shouldn't really change what's passed to the function, actually does: when the function has no arguments, it receives a pd.Series (which is unexpected, as according to the docs, if a function, must either work when passed a DataFrame or when passed to DataFrame.apply.), while if you add an argument it behaves as expected and gets a DataFrame.

Expected Behavior

The aggregating function gets a pd.DataFrame regardless of having passed args or not.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : b5958ee1999e9aead1938c0bba2b674378807b3d
python           : 3.9.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.17.11-1rodete2-amd64
Version          : #1 SMP PREEMPT Debian 5.17.11-1rodete2 (2022-06-09)
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.5
numpy            : 1.21.5
pytz             : 2022.1
dateutil         : 2.8.1
pip              : None
setuptools       : unknown
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.5
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 3.2.3
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.7.4
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : 2.6.10dev0
odfpy            : None
openpyxl         : None
pandas_gbq       : 0.12.0
pyarrow          : 6.0.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.8.1
sqlalchemy       : None
tables           : 3.6.2-dev
tabulate         : 0.8.9
xarray           : None
xlrd             : 1.2.0
xlwt             : None
numba            : None
montali commented 2 years ago

This may be linked to https://github.com/pandas-dev/pandas/issues/44813