Named aggregations with multiple columns

erfannariman commented 4 years ago

Since pandas 0.25.0 we have named aggregations.

Which works fine if you do aggregations on single columns. But what if you want to apply aggregations over multiple columns:

example:

# example dataframe
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]

          a         b         c         d  group
0  0.751462  0.572576  0.192957  0.921723      0
1  0.070777  0.801548  0.601678  0.344633      0
2  0.112964  0.361984  0.416241  0.785764      1
3  0.380045  0.486494  0.000594  0.608759      1

# aggregations on single columns
df.groupby('group').agg(
             a_sum=('a', 'sum'),
             a_mean=('a', 'mean'),
             b_mean=('b', 'mean'),
             c_sum=('c', 'sum'),
             d_range=('d', lambda x: x.max() - x.min())
)

          a_sum    a_mean    b_mean     c_sum   d_range
group                                                  
0      0.947337  0.473668  0.871939  0.838150  0.320543
1      0.604149  0.302074  0.656902  0.542985  0.057681

But what if we want to calculate the a.max() - b.max() while aggregating. That does not seem to work. For example, something like this would make sense:

df.groupby('group').agg(
    diff_a_b=(['a', 'b'], lambda x: x['a'].max() - x['b'].max())
)

So is it possible to do named aggregations on multiple columns? If not, is this in the pipeline for future releases?

delica1 commented 4 years ago

Yes please. I would also be interested in this feature. I posted a feature request a few months ago but good to see I am not alone. #28190

If I am not mistaken, it seems it may be easier to implement now with the named aggregates functionality too.

SpectrumWings commented 4 years ago

take

theSuiGenerisAakash commented 4 years ago

Is it out now?

erfannariman commented 4 years ago

@SpectrumWings are you still working on this? Else I would like to give it a go.

erfannariman commented 3 years ago

take

JasonAHendry commented 2 years ago

Hi all. Just wanted to say I would love to see this feature developed. It's a routine very commonly needed in scientific data analysis. dplyr &c support it; would be fantastic to see in pandas.

SanderLam commented 2 years ago

Hi there, is there any update on when we can expect this feature?

jreback commented 2 years ago

@SanderLam pandas is all volunteer

features happen when the community does pull requests - you are welcome to do that

core can provide review

Mondonauta commented 1 year ago

I'm very interested in this feature as well

nick-konovalchuk commented 1 year ago

Looking forward for this one

alink-volpe commented 10 months ago

Very interested in this, too. I keep getting bummed out that pandas isn't quite as elegant as R when it comes to groupby > aggregate logic, but this would be a great addition!

tawfikharoun commented 8 months ago

Interestingly, Polars organically does that! So if this is super needed, you can import the DF to Polars and do that. I genuinely believe that Pandas should adapt that as well.

samukweku commented 2 months ago

take

tawfikharoun commented 2 months ago

@samukweku it would be something like: import polars as pl df.group_by("col0").agg( sum_all_under_200 = pl.col('col1').filter(pl.col('col2') > 200).sum() )

tawfikharoun commented 2 months ago

Hello,

It would be something like: import polars as pl df.group_by("col0").agg( sum_all_under_200 = pl.col('col1').filter(pl.col('col2') > 200).sum() )

From: Samuel Oranyeli @.> Date: Sunday, June 30, 2024 at 5:24 AM To: pandas-dev/pandas @.> Cc: Tawfik @.>, Mention @.> Subject: Re: [pandas-dev/pandas] Named aggregations with multiple columns (#29268)

@tawfikharounhttps://github.com/tawfikharoun can you share an example of how polars does this? Thanks

— Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/29268#issuecomment-2198395060, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APTIJ3ALCL6P6N4733XDDNDZJ5M6TAVCNFSM4JGKBAY2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJZHAZTSNJQGYYA. You are receiving this because you were mentioned.Message ID: @.***>

pandas-dev / pandas

Named aggregations with multiple columns #29268