pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.54k stars 17.89k forks source link

BUG: DatetimeIndexResamplerGroupby as_index interaction with .agg() #52397

Open 1112114641 opened 1 year ago

1112114641 commented 1 year ago

Pandas version checks

Reproducible Example

# works
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").min()

# works
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").agg(min)

# works
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=True).resample("2min").agg({"b":"min"})

# does not work
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").agg({"b":"min"})

Issue Description

Using pandas 2.0.0 / python 3.9.13 running the above code yields different behaviour based on whether the as_index=False flag is set in the .groupby() method or not.

The resulting error message ValueError: Length of values (5) does not match length of index (30) is misleading, uninformative, and does not explain the difference in behaviour between .agg({"b":"min"}), and .agg(min). The same holds true for other standard .groupby aggregations (max, first, ...).

Could you please look into this?

Expected Behavior

.agg({"b":"min"}) and .agg(min) should yield the same result, irrespective of as_index=True/False.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.9.13.final.0 python-bits : 64 OS : Darwin OS-release : 22.4.0 Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3[/RELEASE_ARM64_T6000](https://file+.vscode-resource.vscode-cdn.net/RELEASE_ARM64_T6000) machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.0.0 numpy : 1.23.5 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 63.3.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.9.0 pandas_datareader: None bs4 : 4.11.2 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.3 numba : 0.56.4 numexpr : None odfpy : None openpyxl : 3.1.1 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.0 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
szczekulskij commented 1 year ago

Can I pick this up and get two weeks to dig into the codebase? Thanks

luke396 commented 1 year ago

@szczekulskij Just wanted to kindly remind you that commenting on 'take' will automatically assign you to this issue. No pressure though!

1112114641 commented 1 year ago

as an interesting aside, I discovered a new method of defining aggregations, and this code works also:

pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").agg(b=("b","min"))
szczekulskij commented 1 year ago

take

szczekulskij commented 1 year ago

To keep this thread updated - I've identified the problem is in pandas/core/groupby/generic.py. Line 421: if not self.as_index and not_indexed_same: - if this line doesn't result in True, everything works as expected

I'll now work on solving this

szczekulskij commented 1 year ago

Hey, could I ask for review on the attached PR ? It definitely solves the issue, but I'm open for feedback - I'm still learning more about the repo

rhshadrach commented 1 year ago

The last example in the OP doesn't raise on 1.5.x, marking as a regression.

rhshadrach commented 1 year ago

It's not clear to me what the expected behavior of using as_index=False should be. I'm not finding any tests with groupby(...).resample(...) and as_index=False. In the docs, we say as_index has no impact on filtrations or transformations, but a resample doesn't fall into these categories.

cc @jbrockmendel @mroeschke @topper-123

rhshadrach commented 1 year ago

A closer look at behavior, it appears to me as_index has no impact to resampler across methods. I think we stick with that for this issue, if we want to change that then we can take it up in a dedicated issue.

szczekulskij commented 1 year ago

Hey @rhshadrach, sorry for delay - updated now. Happy to create a seperate ticket to implement a different functionality dependent on whether as_index=False (if I understood correctely, this is what you've discussed in the previous point)

Thanks for feedback and going w. me through this ticket