Closed noppelmax closed 2 months ago
Thanks for the request. Agreed these are valid use cases that can come up, but I don't think it's prominent enough to expand the API. agg / apply already have various bad behaviors, and I do not think adding more complexity is a good idea, at least at this time.
In cases like these, you can always iterate over the groupby object directly. The downside is this may take longer depending on the ratio of number of groups
to number of rows
(larger ratios will be less performant).
size = 100000
df = pd.DataFrame(
{
"key": np.random.randint(0, 5, size),
"a": np.random.random(size),
}
)
gb = df.groupby("key")
# populate caches
_ = gb.sum()
def foo(gb):
return gb.agg(lambda x: x.sum())
def bar(gb):
buf = {}
for group, data in gb:
buf[group] = data["a"].sum()
result = pd.Series(buf)
result.index.name = "key"
return result
%timeit foo(gb)
823 μs ± 3.55 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit bar(gb)
997 μs ± 3.23 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Agreed that it's probably not worth expanding the API for this use case. Thanks for the suggestion but closing
Feature Type
[ ] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
I want to aggregate multiple columns with bootstrapping (scipy.stats.bootstrap). However, this aggregation produces multiple scalar outputs, e.g., the lower and upper bound of the confidence interval, and the mean (= at least 3 columns). I want to apply this aggregation to multiple columns independently (hence 'apply is not really suitable). However,
agg
can only handle aggregation function with one scalar output. Bootstrapping is a 101 task in data analysis and I don't really understand why this is so complicated to implement in pandas. I know that theoretically I could write one aggregate function for the lower, upper bound and the mean separately. but that would 1. take three times longer and 2. would mathematically be questionable as the bound and the mean might come from different random processes. Fixing the randomness would be one way, but still three times longer is not good...Feature Description
Here is some demo code
I would expect to get multi-indexed data frame like
Alternative Solutions
The hacky workaround I use right now is something along the lines of
and later
So I return a tuple and then extract the tuple later into multiple columns. This requires a lot hardcoding...
Additional Context
No response