pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.61k stars 17.9k forks source link

ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg #58225

Open rhshadrach opened 6 months ago

rhshadrach commented 6 months ago

This topic was discussed in the dev meeting on 2024-04-10. I believe I captured all ideas expressed there.

Summary: DataFrameGroupBy.agg will sometimes pass the UDF each column of the original DataFrame as a Series ("by-column"), and sometimes pass the UDF the entire DataFrame ("by-frame"). This causes issues for users. Either we should only operate by-column, or allow users to control whether the UDF they provide is passed a Series or DataFrame. If we do enable users to control this, we need to decide on what the right behavior is in the by-frame case.

Relevant code: https://github.com/pandas-dev/pandas/blob/b4493b6b174434a2c7f99e7bcd58eedf18f037fa/pandas/core/groupby/generic.py#L1546-L1569

Logic: In the case where the user passes a callable (UDF), pandas will operate by-frame when:

  1. There is one grouping (e.g. .groupby("a") and not .groupby(["a", "b"])) and the user is passing args or kwargs through to the UDF; or
  2. There is one grouping and attempting .agg([func]) raises a ValueError with "No objects to concatenate" in the message

In all other cases, pandas operates by-column. The only place (2) is currently hit in the test-suite is aggregating an empty DataFrame. I do not see how else it might be possible to hit (2).

Impact: When a user provides a UDF that can work either by-column or by-frame, but not necessarily produce the same result, whether they have a single grouping and/or provide a arg/kwarg to pass through can produce different results. This is #39169.

This seems to me like a particularly bad behavior in DataFrameGroupBy.agg, and I'd like to resolve this bug.

Option A: Only support by-column in groupby(...).agg

Pros: Most straight-forward. Cons: Removes ability of users to use agg when they want the UDF to use multiple columns (which isn't very accessible anyways).

Option B: Add an argument

Users would communicate what they want the UDF to accept by passing an argument.

df.groupby(...).agg(func, by="frame")

# or    

df.groupby(...).agg(func, by="column")

Pros: Enables users to have their UDF accept a frame when the desire. Cons: We need to decide on behavior in the case of by="frame" (see below).

Option C-a: Required Decorator

Users would communicate what they want the UDF to accept by adding a decorator.

@pd.agg_decorator(by="frame")
def func(...):
    ...

# or    

@pd.agg_decorator(by="column")
def func(...):
    ...

Pros: Enables users to have their UDF accept a frame when the desire. Cons:

Option C-b: Optional Decorator

This is somewhat of a combination of Option A and Option C-a. Without a decorator, we would always operate by-column. If the decorator is provided, we could operate by-frame.

If we do decide to support by-frame

Then we need to decide on its behavior. Currently _aggregate_frame will create a dictionary, roughly as

{group: func(group_df, *args, **kwargs) for group, group_df in self}

pass this dictionary to the DataFrame constructor, and then take a transpose. https://github.com/pandas-dev/pandas/blob/b4493b6b174434a2c7f99e7bcd58eedf18f037fa/pandas/core/groupby/generic.py#L1614-L1621

This behavior works well when scalars are returned by the UDF, but not iterable objects. E.g.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")

def func1(x, y):
    return [1, 2, 3]
print(gb.agg(func1, y=3))
# ValueError: Length of values (3) does not match length of index (1)

def func2(x, y):
    return [1]
print(gb.agg(func2, y=3))
#    b
# a   
# 1  1
# 2  1

def func3(x, y):
    return pd.Series([1, 2, 3])
print(gb.agg(func3, y=3))
#     b
# a    
# 1 NaN
# 2 NaN

Option I: Treat all returns as if they are scalars.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}))
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object

Pros: Simple for users to understand and us to implement. Cons: Does not allow performance enhancement by operating on a DataFrame and returning a Series (e.g. lambda x: x.sum() where x is a DataFrame)

Option II: Add an argument.

Users would communicate what they want the UDF return interpreted as via an argument.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="row")
#    x  y
# 0  5  4
# 1  5  4
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="scalar")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object

Pros: Explicit user control. Cons: Has an additional argument.

Option IIIa: Series is interpeted as stack vertically, all others are scalars.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
#    x  y
# 0  5  4
# 1  5  4

Pros: No additional argument / features needed. Cons: Somewhat magical, users cannot have pandas treat Series as if it were a scalar.

Option IIIb: Series is interpeted as stack vertically unless boxed with pd.Scalar, all others are scalars.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
#    x  y
# 0  5  4
# 1  5  4

By boxing the UDFs result with pd.Scalar tells pandas to treat it like a scalar. Note that pd.Scalar does not currently exist.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Scalar(pd.Series({"x": 5, "y": 4})), by="frame")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object

Pros: No additional argument needed. Cons: Somewhat magical, need to add pd.Scalar.

cc @jorisvandenbossche @jbrockmendel @Dr-Irv

Dr-Irv commented 6 months ago

Nice write up.

If we were to choose Option A, and if someone had a func that operated on a DataFrame, what would be their migration path? Can you illustrate via an example?

rhshadrach commented 6 months ago

If we were to choose Option A, and if someone had a func that operated on a DataFrame, what would be their migration path? Can you illustrate via an example?

During the dev meeting, I expressed doubt that this could be reliably answered. Seeing how bad the behavior of _aggregate_frame (the by-frame case of agg) is for iterables, it appears to me to be much easier than I was anticipating.

In the OP, I detailed the poor behavior of _aggregate_frame when the UDF returns iterables. So I think (but still do not quite feel certain) this means we only need to consider two cases: scalar returns and Series returns which can be aligned with the columns. I am saying "can be aligned with the columns" because, as shown in the OP example, if the result is a Series, _aggregate_frame aligns the result's index with the DataFrame's columns.

Scalar return: _aggregate_frame gives an undesired result whereas using apply gives the expected. Series return: Both return the same, expected, result.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5], "c": [3, 4, 5]})
gb = df.groupby("a")

# Scalar return
def func(x, **kwargs):
    return np.sum(np.sum(x))
print(gb.agg(func, y=2))
#     b   c
# a
# 1  14  14
# 2  10  10
print(gb.apply(func, include_groups=False))
# a
# 1    14
# 2    10
# dtype: int64

# Series return that is alignable with the columns
def func(x, **kwargs):
    return x.sum() + np.sum(np.sum(x))
print(gb.agg(func, y=2))
#     b   c
# a        
# 1  21  21
# 2  15  15
print(gb.apply(func, include_groups=False))
#     b   c
# a        
# 1  21  21
# 2  15  15
rhshadrach commented 6 months ago

One thing I missed: DataFrameGroupBy.apply has special handling for Series which would make it undesirable if users want to have a Series result treated as if it were a scalar for the purposes of aggregation. I personally do not think this is a significant case.

If the index of the result is always the same, DataFrameGroupBy.apply stacks the results horizontally (the first example). If the index of the result is not always the same, DataFrameGroupBy.apply stacks the results vertically. When you have an operation that behaves like the latter (index is not the same for each group) but operates on a single group, you get the behavior of the former case (index is the same for each group). This is #54992.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5], "c": [3, 4, 5]})
gb = df.groupby("a")

def func(x, **kwargs):
    return pd.Series([1, 2, 3])
print(gb.apply(func, include_groups=False))
#    0  1  2
# a         
# 1  1  2  3
# 2  1  2  3

def func(x, **kwargs):
    if len(x) == 2:
        return pd.Series([1, 2, 3], index=list("xyz"))
    else:
        return pd.Series([4, 5, 6], index=list("xwz"))
print(gb.apply(func, include_groups=False))
# a   
# 1  x    1
#    y    2
#    z    3
# 2  x    4
#    w    5
#    z    6
# dtype: int64
jbrockmendel commented 6 months ago

If the index of the result is not always the same

As mentioned in the call, these checks can get expensive.

I've been liking the decorator idea more the more I think about it. I think it would apply to UDFs in general, not just agg (maybe not even just groupby). We would get a lot of mileage just by having it specify its argument/returns types. More if it specifies "All returned Series have the same Index" or "here are the return dtypes". In the latter case we could get a non-trivial perf improvement by avoiding an object-dtype intermediate (with non-trivial effort on our end).

I've talked myself into liking the decorator idea on the theory that

I've talked myself into liking the idea Joris mentioned of using a decorator to let users tell us what their function takes/returns. I also think that can improve perf in cases where e.g. dtypes can be known in advance and we can avoid an object-dtype intermediate. (that perf improvement would take some non-trivial effort on our end (though now that I think of it, is that just reinventing numba?))

Short-term, the only strong opinion I have is that "by" should not be re-used, as it already has a meaning in other groupby methods.

rhshadrach commented 6 months ago

I've been liking the decorator idea more the more I think about it.

My only opposition to the decorator is that it would make inline lambdas more awkward. But it's not strong opposition.

rhshadrach commented 6 months ago

Short-term, the only strong opinion I have is that "by" should not be re-used, as it already has a meaning in other groupby methods.

Short-term, I would like to resolve the bug https://github.com/pandas-dev/pandas/issues/39169 by removing any operation by-frame in DataFrameGroupBy.agg and only operate by-column for now. We can then add by-frame as a clean enhancement in the future. As mentioned in the OP, the only cases you get to by-frame behavior currently is (a) one grouping and supplying args/kwargs or (b) an empty input. I think those are small enough cases and the bug is sufficiently bad behavior that it should be fixed sooner rather than later.

For implementation, I would deprecate those cases and instruct the user to use apply instead.

Is there any support or opposition to this path forward?