groupby.apply modifies the index of an empty series

jluttine commented 6 years ago

Code Sample, a copy-pastable example if possible

Correct behaviour for non-empty series - The index is kept unchanged:

>>> pd.Series(index=pd.DatetimeIndex(["2018-01-01"]), data=[10]).groupby([1]).apply(lambda x: x).index
DatetimeIndex(['2018-01-01'], dtype='datetime64[ns]', freq=None)

Incorrect behaviour for empty series - The index is changed:

>>> pd.Series(index=pd.DatetimeIndex([]), data=[]).groupby([]).apply(lambda x: x).index
Float64Index([], dtype='float64')

Problem description

The index should remain unchanged.

Why does this matter at all?

Can't do operations on the result that work on datetime index but not on float index. For instance .loc["2018-01-01":]
I'm using unit tests check that series is what is expected and now it's not because the index is something weird.

Expected Output

Expected behaviour for empty series:

>>> pd.Series(index=pd.DatetimeIndex([]), data=[]).groupby([]).apply(lambda x: x).index
DatetimeIndex([], dtype='datetime64[ns]', freq=None)

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.14.42 machine: x86_64 processor: byteorder: little LC_ALL: None>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.14.42 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 pandas: 0.22.0 pytest: None pip: None setuptools: 39.0.1 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.1 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: None openpyxl: 2.5.2 xlrd: 0.9.4 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.6 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 pandas: 0.22.0 pytest: None pip: None setuptools: 39.0.1 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.1 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: None openpyxl: 2.5.2 xlrd: 0.9.4 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.6 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

mroeschke commented 6 years ago

Thanks for the report. Sounds reasonable and I can replicate this on the latest release. Investigation and PR's welcome!

rhshadrach commented 4 years ago

For an Index, I think it's relatively straightforward to solve this. However, for MultiIndex I don't know of a direct way to create an empty MultiIndex with a specified dtypes for each of the levels. The only way I can figure out how to do this is to create a non-emptyDataFrame with the specified types, and then subset it so that it becomes empty; e.g.

df = pd.DataFrame(
  {
    'a': pd.DatetimeIndex(["2018-01-01"]),
    'b': pd.DatetimeIndex(["2018-01-01"]),
    'c': 1
  }
).set_index(['a', 'b'])
df = df[df.c == 0]

Is there a better way, even internally?

RobbieClarken commented 3 years ago

I'm seeing something similar with an empty DataFrame in the latest pandas (v1.1.5):

>>> import pandas as pd
>>> pd.__version__
'1.1.5'
>>> df = pd.DataFrame([], columns=["A", "B"])
>>> df
Empty DataFrame
Columns: [A, B]
Index: []
>>> df.groupby("A", group_keys=False).apply(lambda g: g)
Empty DataFrame
Columns: []
Index: []

I would expect the groupby.apply to preserve the columns of the empty DataFrame. I haven't checked to see whether #34998 fixes this.

pandas-dev / pandas