BUG: DataFrame.agg produces different types if the DataFrame is empty

fluggo commented 3 years ago

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

tdf2 = pd.DataFrame([], columns=['lang', 'name'])
print(type(tdf2.agg({'name': lambda y: y.values})))

tdf2 = pd.DataFrame([['a', 'boof']], columns=['lang', 'name'])
print(type(tdf2.agg({'name': lambda y: y.values})))

Problem description

This seems very similar to the description of #16621, and may be the same bug; check there for some history, which points to an original fix in #2476. You can also look at this StackExchange question if you want to see how I nearly pulled my hair out.

The first call of .agg() above, on the empty DataFrame, produces a DataFrame result. The second call of .agg() produces a Series instead.

This has unfortunate implications when used with groupby():

tdf2 = (
    pd.DataFrame([['a', 'boof'], ['a', 'toop']], columns=['lang', 'name'])
    .astype({'lang': pd.CategoricalDtype(['a', 'b'])})
)
print(tdf2.groupby('lang').apply(lambda x: x.agg({'name': lambda y: ', '.join(y.values)})))

...produces:

lang		0	name
a	name	boof, toop	NaN

...instead of:

lang	name
a	boof, toop

Expected Output

I'm assuming that the call with data produces a series because the agg() is single-valued, therefore the top code should probably produce two Series:

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Actual output

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

Output of `pd.show_versions()`

``` INSTALLED VERSIONS ------------------ commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.8.8.final.0 python-bits : 64 OS : Darwin OS-release : 19.6.0 Version : Darwin Kernel Version 19.6.0: Mon Apr 12 20:57:45 PDT 2021; root:xnu-6153.141.28.1~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.4 numpy : 1.20.1 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 52.0.0.post20210125 Cython : 0.29.23 pytest : 6.2.3 hypothesis : None sphinx : 4.0.1 blosc : None feather : None xlsxwriter : 1.3.8 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 2.11.3 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : 0.9.0 fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : 1.4.7 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.53.1 ```

EricFleishman26 commented 3 years ago

I am trying to assist with this issue, however when I try to run my testcases I am getting a ModuleNotFoundError.

ModuleNotFoundError: No module named 'pandas._libs.interval'

Does anyone have an idea as to why this is?

nmay231 commented 3 years ago

@EricFleishman26 If you are using a clone of pandas in your development environment, it might be because you have not compiled the c extensions of pandas. See https://pandas.pydata.org/docs/development/contributing.html#creating-a-development-environment

pandas-dev / pandas