pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.82k stars 17.99k forks source link

ENH: Allow custom aggregation functions with multiple return values. #59781

Closed noppelmax closed 2 months ago

noppelmax commented 2 months ago

Feature Type

Problem Description

I want to aggregate multiple columns with bootstrapping (scipy.stats.bootstrap). However, this aggregation produces multiple scalar outputs, e.g., the lower and upper bound of the confidence interval, and the mean (= at least 3 columns). I want to apply this aggregation to multiple columns independently (hence 'apply is not really suitable). However, agg can only handle aggregation function with one scalar output. Bootstrapping is a 101 task in data analysis and I don't really understand why this is so complicated to implement in pandas. I know that theoretically I could write one aggregate function for the lower, upper bound and the mean separately. but that would 1. take three times longer and 2. would mathematically be questionable as the bound and the mean might come from different random processes. Fixing the randomness would be one way, but still three times longer is not good...

Feature Description

Here is some demo code

import pandas as pd
import numpy as np

def custom_aggregate(data):
    # return 1  # Works
    return pd.Series({
        'mean_ci_lower': [0.3], # hardcoded for demo
        'mean_ci_upper': [0.5], # hardcoded for demo
        'real_mean': np.mean(data),
    })

def main():
    data = {
    'acctrain': [0.496070, 0.579231, 0.1, 0.3],
    'acctest':  [0.455256, 0.147513, 0.1, 0.5],
    'experimentname': ['experimentA', 'experimentB', 'experimentA', 'experimentB']
    }

    df = pd.DataFrame(data)
    print(df)
    print("Aggregated: ")
    df2 = df.groupby(["experimentname"]).agg({
        "acctrain": custom_aggregate,
        "acctest": custom_aggregate,
    }).reset_index()
    print(df2)

if __name__ == '__main__':
    main()

I would expect to get multi-indexed data frame like

                acctrain                                               acctest                                                   
                bs_mean_ci_lower    bs_mean_ci_upper    real_mean      bs_mean_ci_lower    bs_mean_ci_upper    real_mean    
experimentname                                                                                                              
experimentA     0.3                 0.4                 0.298035       0.3                 0.4                 0.277628     
experimentB     0.3                 0.4                 0.439616       0.3                 0.4                 0.323757     

Alternative Solutions

The hacky workaround I use right now is something along the lines of

def custom_aggregate(data):
    return (1,2)

and later

df2[['acctrain_ci_lower', 'acctrain_ci_upper']] = pd.DataFrame(df2['acctrain'].tolist(), index=df2.index)
df2[['acctest_ci_lower', 'acctest_ci_upper']] = pd.DataFrame(df2['acctest'].tolist(), index=df2.index)

So I return a tuple and then extract the tuple later into multiple columns. This requires a lot hardcoding...

Additional Context

No response

rhshadrach commented 2 months ago

Thanks for the request. Agreed these are valid use cases that can come up, but I don't think it's prominent enough to expand the API. agg / apply already have various bad behaviors, and I do not think adding more complexity is a good idea, at least at this time.

In cases like these, you can always iterate over the groupby object directly. The downside is this may take longer depending on the ratio of number of groups to number of rows (larger ratios will be less performant).

size = 100000

df = pd.DataFrame(
    {
        "key": np.random.randint(0, 5, size),
        "a": np.random.random(size),
    }
)
gb = df.groupby("key")
# populate caches
_ = gb.sum()

def foo(gb):
    return gb.agg(lambda x: x.sum())

def bar(gb):
    buf = {}
    for group, data in gb:
        buf[group] = data["a"].sum()
    result = pd.Series(buf)
    result.index.name = "key"
    return result

%timeit foo(gb)
823 μs ± 3.55 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit bar(gb)
997 μs ± 3.23 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
mroeschke commented 2 months ago

Agreed that it's probably not worth expanding the API for this use case. Thanks for the suggestion but closing