pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.14k stars 17.75k forks source link

ENH: Add Support for GroupBy Numeric Operations #20060

Open WillAyd opened 6 years ago

WillAyd commented 6 years ago

xref some of the conversation in #20024 right now the following is possible

In [16]: df = pd.DataFrame([(0, 1), (0, 2), (1, 3), (1, 4)], columns=['key', 'val'])
In [18]: df - df.mean()
Out[18]: 
   key  val
0 -0.5 -1.5
1 -0.5 -0.5
2  0.5  0.5
3  0.5  1.5

In [19]: df['val'] - df['val'].mean()
Out[19]: 
0   -1.5
1   -0.5
2    0.5
3    1.5
Name: val, dtype: float64

But trying to do something similar with grouped data does not work:

In [20]: df.groupby('key') - df.groupby('key').mean()
        ...
ValueError: Unable to coerce to Series, length must be 1: given 2

I am proposing that we update the GroupBy class to allow numerical operations with the result of aggregations or transformations against that object. Note that this is possible today through a much more verbose and hackish:

In [23]: df.groupby('key').shift(0) - df.groupby('key').transform('mean')
Out[23]: 
   val
0 -0.5
1  0.5
2 -0.5
3  0.5

The Series / DataFrame operations are all added via add_special_arithmetic_methods with their implementations being defined in ops.py. We could leverage a similar mechanism for GroupBy

Why is this worth doing?

  1. Consistent arithmetic ops for Series, DataFrame and GroupBy objects
  2. May enable deprecation of methods like mad (see #20024)
  3. Provides easier "demeaning" and "normalization" for grouped data
  4. Mirrors xarray implementation which appears well received by user base

Why may it not be worth doing?

  1. Will add more complexity to a GroupBy class that is already in need of refactor
  2. TBD

Consideration Points With this proposal, the left operand would always be a GroupBy object and the right operand would always be a the result of a function application against that same GroupBy. The result of the operation should be a Series or DataFrame like-indexed to the original object.

That said, the following operations would in theory be identical:

df.groupby('key') - df.groupby('key').mean()
# OR
df.groupby('key') - df.groupby('key').transform('mean')

I'm not sure if we care to differentiate between these and force users into choosing one or the other.

Thoughts?

WillAyd commented 6 years ago

@shoyer

shoyer commented 6 years ago

With this proposal, the left operand would always be a GroupBy object and the right operand would always be a the result of a function application against that same GroupBy.

It would be surprising not to also support the other order, e.g., other + groupby. This should be totally doable, though.

jreback commented 6 years ago

This already is the idiomatic way, quite explict.

In [3]: df.val - df.groupby('key').val.transform('mean')
Out[3]: 
0   -0.5
1    0.5
2   -0.5
3    0.5
Name: val, dtype: float64
WillAyd commented 6 years ago

Fair enough on Series, though when working with frames you'd have to explicitly drop the grouped columns

In [14]: df - df.groupby('key').transform('mean')
Out[14]: 
   key  val
0  NaN -0.5
1  NaN  0.5
2  NaN -0.5
3  NaN  0.5

So I guess the question becomes do we think abstracting all of this through support for GroupBy operations is worth it or would we rather live with some slight inconsistencies in how to calculate across the various objects

shoyer commented 6 years ago

Another use case is when the normalization values are pre-computed, perhaps from another dataset, e.g., you only have access todf.groupby('key').mean(). You can certainly still apply these in pandas, but it requires explicit reindexing or merging.

rhshadrach commented 6 months ago

@WillAyd - Just taking a look through old issues to see if we have any that can be closed. Have your thoughts here evolved at all? For the DataFrame case you can do:

print(df - df.groupby('key')[df.columns].transform('mean'))
#    key  val
# 0  0.0 -0.5
# 1  0.0  0.5
# 2  0.0 -0.5
# 3  0.0  0.5

Though I'm not sure that would be the indented result; should key be left untouched rather than normalized? I imagine that would be the result of df.groupby('key') - df.groupby('key').transform('mean') too, but not sure.