pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.95k stars 18.04k forks source link

BUG: groupby.ohlc returns Series rather than DataFrame if input is empty #42902

Closed batterseapower closed 1 year ago

batterseapower commented 3 years ago

Code Sample, a copy-pastable example

# This returns a pd.Series with 0 items
pd.DataFrame.from_dict({
    'timestamp': pd.DatetimeIndex([]),
    'symbol': np.array([], dtype=object),
    'price': np.array([], dtype='f4'),
}).set_index('timestamp').groupby('symbol').resample('D', closed='right')['price'].ohlc()

# This returns a pd.DataFrame with columns open, high, low, close and a single row (index=[(ABC, 2021-01-01)])
pd.DataFrame.from_dict({
    'timestamp': pd.DatetimeIndex(['2021-01-01']),
    'symbol': np.array(['ABC'], dtype=object),
    'price': np.array([100], dtype='f4'),
}).set_index('timestamp').groupby('symbol').resample('D', closed='right')['price'].ohlc()

Problem description

It is surprising that groupby.ohlc returns different data types depending on whether or not there are zero rows.

Expected Output

I think it would be more sensible for the zero row case to return a DataFrame with columns open/high/low/close, but with zero rows. Returning a Series as a special case doesn't make sense.

Output of pd.show_versions()

(This failed with an error about llvmlite.dll in the environment with pandas 1.3.1 installed, so this output is actually from a slightly earlier install of Pandas which still exhibits the bug.)

INSTALLED VERSIONS ------------------ commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.8.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18363 machine : AMD64 processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252 pandas : 1.2.4 numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 52.0.0.post20210125 Cython : 0.29.23 pytest : 6.2.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.0 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : 0.9.0 fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : 3.0.0 pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : None tables : None tabulate : None xarray : 0.18.0 xlrd : 2.0.1 xlwt : None numba : 0.53.0
rhshadrach commented 3 years ago

Thanks for the report - agreed and confirmed on master. Further investigations and a PR to fix is most welcome.

ndtretyak commented 3 years ago

SeriesGroupBy._wrap_applied_output is used here to create the resulting object https://github.com/pandas-dev/pandas/blob/860ff03ee1e4de8a1af1cf55e844f38d7876253a/pandas/core/groupby/groupby.py#L1304-L1306

However, if the initial dataframe is empty, then there is no group keys and _wrap_applied_output just returns an empty Series https://github.com/pandas-dev/pandas/blob/860ff03ee1e4de8a1af1cf55e844f38d7876253a/pandas/core/groupby/generic.py#L444-L451

jbrockmendel commented 1 year ago

Looks like we have several incorrect tests in tests.resample.test_base testing for the wrong behavior here