pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

groupby.apply modifies the index of an empty series #21192

Open jluttine opened 6 years ago

jluttine commented 6 years ago

Code Sample, a copy-pastable example if possible

Correct behaviour for non-empty series - The index is kept unchanged:

>>> pd.Series(index=pd.DatetimeIndex(["2018-01-01"]), data=[10]).groupby([1]).apply(lambda x: x).index
DatetimeIndex(['2018-01-01'], dtype='datetime64[ns]', freq=None)

Incorrect behaviour for empty series - The index is changed:

>>> pd.Series(index=pd.DatetimeIndex([]), data=[]).groupby([]).apply(lambda x: x).index
Float64Index([], dtype='float64')

Problem description

The index should remain unchanged.

Why does this matter at all?

Expected Output

Expected behaviour for empty series:

>>> pd.Series(index=pd.DatetimeIndex([]), data=[]).groupby([]).apply(lambda x: x).index
DatetimeIndex([], dtype='datetime64[ns]', freq=None)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.14.42 machine: x86_64 processor: byteorder: little LC_ALL: None>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.14.42 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 pandas: 0.22.0 pytest: None pip: None setuptools: 39.0.1 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.1 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: None openpyxl: 2.5.2 xlrd: 0.9.4 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.6 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 pandas: 0.22.0 pytest: None pip: None setuptools: 39.0.1 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.1 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: None openpyxl: 2.5.2 xlrd: 0.9.4 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.6 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
mroeschke commented 6 years ago

Thanks for the report. Sounds reasonable and I can replicate this on the latest release. Investigation and PR's welcome!

rhshadrach commented 4 years ago

For an Index, I think it's relatively straightforward to solve this. However, for MultiIndex I don't know of a direct way to create an empty MultiIndex with a specified dtypes for each of the levels. The only way I can figure out how to do this is to create a non-emptyDataFrame with the specified types, and then subset it so that it becomes empty; e.g.

df = pd.DataFrame(
  {
    'a': pd.DatetimeIndex(["2018-01-01"]),
    'b': pd.DatetimeIndex(["2018-01-01"]),
    'c': 1
  }
).set_index(['a', 'b'])
df = df[df.c == 0]

Is there a better way, even internally?

RobbieClarken commented 3 years ago

I'm seeing something similar with an empty DataFrame in the latest pandas (v1.1.5):

>>> import pandas as pd
>>> pd.__version__
'1.1.5'
>>> df = pd.DataFrame([], columns=["A", "B"])
>>> df
Empty DataFrame
Columns: [A, B]
Index: []
>>> df.groupby("A", group_keys=False).apply(lambda g: g)
Empty DataFrame
Columns: []
Index: []

I would expect the groupby.apply to preserve the columns of the empty DataFrame. I haven't checked to see whether #34998 fixes this.