Open rhshadrach opened 6 months ago
Nice write up.
If we were to choose Option A, and if someone had a func that operated on a DataFrame, what would be their migration path? Can you illustrate via an example?
If we were to choose Option A, and if someone had a func that operated on a DataFrame, what would be their migration path? Can you illustrate via an example?
During the dev meeting, I expressed doubt that this could be reliably answered. Seeing how bad the behavior of _aggregate_frame
(the by-frame case of agg
) is for iterables, it appears to me to be much easier than I was anticipating.
In the OP, I detailed the poor behavior of _aggregate_frame
when the UDF returns iterables. So I think (but still do not quite feel certain) this means we only need to consider two cases: scalar returns and Series returns which can be aligned with the columns. I am saying "can be aligned with the columns" because, as shown in the OP example, if the result is a Series, _aggregate_frame
aligns the result's index with the DataFrame's columns.
Scalar return: _aggregate_frame
gives an undesired result whereas using apply
gives the expected.
Series return: Both return the same, expected, result.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5], "c": [3, 4, 5]})
gb = df.groupby("a")
# Scalar return
def func(x, **kwargs):
return np.sum(np.sum(x))
print(gb.agg(func, y=2))
# b c
# a
# 1 14 14
# 2 10 10
print(gb.apply(func, include_groups=False))
# a
# 1 14
# 2 10
# dtype: int64
# Series return that is alignable with the columns
def func(x, **kwargs):
return x.sum() + np.sum(np.sum(x))
print(gb.agg(func, y=2))
# b c
# a
# 1 21 21
# 2 15 15
print(gb.apply(func, include_groups=False))
# b c
# a
# 1 21 21
# 2 15 15
One thing I missed: DataFrameGroupBy.apply
has special handling for Series which would make it undesirable if users want to have a Series result treated as if it were a scalar for the purposes of aggregation. I personally do not think this is a significant case.
If the index of the result is always the same, DataFrameGroupBy.apply stacks the results horizontally (the first example). If the index of the result is not always the same, DataFrameGroupBy.apply stacks the results vertically. When you have an operation that behaves like the latter (index is not the same for each group) but operates on a single group, you get the behavior of the former case (index is the same for each group). This is #54992.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5], "c": [3, 4, 5]})
gb = df.groupby("a")
def func(x, **kwargs):
return pd.Series([1, 2, 3])
print(gb.apply(func, include_groups=False))
# 0 1 2
# a
# 1 1 2 3
# 2 1 2 3
def func(x, **kwargs):
if len(x) == 2:
return pd.Series([1, 2, 3], index=list("xyz"))
else:
return pd.Series([4, 5, 6], index=list("xwz"))
print(gb.apply(func, include_groups=False))
# a
# 1 x 1
# y 2
# z 3
# 2 x 4
# w 5
# z 6
# dtype: int64
If the index of the result is not always the same
As mentioned in the call, these checks can get expensive.
I've been liking the decorator idea more the more I think about it. I think it would apply to UDFs in general, not just agg (maybe not even just groupby). We would get a lot of mileage just by having it specify its argument/returns types. More if it specifies "All returned Series have the same Index" or "here are the return dtypes". In the latter case we could get a non-trivial perf improvement by avoiding an object-dtype intermediate (with non-trivial effort on our end).
I've talked myself into liking the decorator idea on the theory that
I've talked myself into liking the idea Joris mentioned of using a decorator to let users tell us what their function takes/returns. I also think that can improve perf in cases where e.g. dtypes can be known in advance and we can avoid an object-dtype intermediate. (that perf improvement would take some non-trivial effort on our end (though now that I think of it, is that just reinventing numba?))
Short-term, the only strong opinion I have is that "by" should not be re-used, as it already has a meaning in other groupby methods.
I've been liking the decorator idea more the more I think about it.
My only opposition to the decorator is that it would make inline lambdas more awkward. But it's not strong opposition.
Short-term, the only strong opinion I have is that "by" should not be re-used, as it already has a meaning in other groupby methods.
Short-term, I would like to resolve the bug https://github.com/pandas-dev/pandas/issues/39169 by removing any operation by-frame in DataFrameGroupBy.agg and only operate by-column for now. We can then add by-frame as a clean enhancement in the future. As mentioned in the OP, the only cases you get to by-frame behavior currently is (a) one grouping and supplying args/kwargs or (b) an empty input. I think those are small enough cases and the bug is sufficiently bad behavior that it should be fixed sooner rather than later.
For implementation, I would deprecate those cases and instruct the user to use apply
instead.
Is there any support or opposition to this path forward?
This topic was discussed in the dev meeting on 2024-04-10. I believe I captured all ideas expressed there.
Summary: DataFrameGroupBy.agg will sometimes pass the UDF each column of the original DataFrame as a Series ("by-column"), and sometimes pass the UDF the entire DataFrame ("by-frame"). This causes issues for users. Either we should only operate by-column, or allow users to control whether the UDF they provide is passed a Series or DataFrame. If we do enable users to control this, we need to decide on what the right behavior is in the by-frame case.
Relevant code: https://github.com/pandas-dev/pandas/blob/b4493b6b174434a2c7f99e7bcd58eedf18f037fa/pandas/core/groupby/generic.py#L1546-L1569
Logic: In the case where the user passes a callable (UDF), pandas will operate by-frame when:
.groupby("a")
and not.groupby(["a", "b"])
) and the user is passingargs
orkwargs
through to the UDF; or.agg([func])
raises a ValueError with "No objects to concatenate" in the messageIn all other cases, pandas operates by-column. The only place (2) is currently hit in the test-suite is aggregating an empty DataFrame. I do not see how else it might be possible to hit (2).
Impact: When a user provides a UDF that can work either by-column or by-frame, but not necessarily produce the same result, whether they have a single grouping and/or provide a arg/kwarg to pass through can produce different results. This is #39169.
This seems to me like a particularly bad behavior in DataFrameGroupBy.agg, and I'd like to resolve this bug.
Option A: Only support by-column in groupby(...).agg
Pros: Most straight-forward. Cons: Removes ability of users to use agg when they want the UDF to use multiple columns (which isn't very accessible anyways).
Option B: Add an argument
Users would communicate what they want the UDF to accept by passing an argument.
Pros: Enables users to have their UDF accept a frame when the desire. Cons: We need to decide on behavior in the case of
by="frame"
(see below).Option C-a: Required Decorator
Users would communicate what they want the UDF to accept by adding a decorator.
Pros: Enables users to have their UDF accept a frame when the desire. Cons:
by="frame"
(see below).Option C-b: Optional Decorator
This is somewhat of a combination of Option A and Option C-a. Without a decorator, we would always operate by-column. If the decorator is provided, we could operate by-frame.
If we do decide to support by-frame
Then we need to decide on its behavior. Currently
_aggregate_frame
will create a dictionary, roughly aspass this dictionary to the DataFrame constructor, and then take a transpose. https://github.com/pandas-dev/pandas/blob/b4493b6b174434a2c7f99e7bcd58eedf18f037fa/pandas/core/groupby/generic.py#L1614-L1621
This behavior works well when scalars are returned by the UDF, but not iterable objects. E.g.
Option I: Treat all returns as if they are scalars.
Pros: Simple for users to understand and us to implement. Cons: Does not allow performance enhancement by operating on a DataFrame and returning a Series (e.g.
lambda x: x.sum()
wherex
is a DataFrame)Option II: Add an argument.
Users would communicate what they want the UDF return interpreted as via an argument.
Pros: Explicit user control. Cons: Has an additional argument.
Option IIIa: Series is interpeted as stack vertically, all others are scalars.
Pros: No additional argument / features needed. Cons: Somewhat magical, users cannot have pandas treat Series as if it were a scalar.
Option IIIb: Series is interpeted as stack vertically unless boxed with
pd.Scalar
, all others are scalars.By boxing the UDFs result with
pd.Scalar
tells pandas to treat it like a scalar. Note thatpd.Scalar
does not currently exist.Pros: No additional argument needed. Cons: Somewhat magical, need to add
pd.Scalar
.cc @jorisvandenbossche @jbrockmendel @Dr-Irv