Open WillAyd opened 6 years ago
@shoyer
With this proposal, the left operand would always be a GroupBy object and the right operand would always be a the result of a function application against that same GroupBy.
It would be surprising not to also support the other order, e.g., other + groupby
. This should be totally doable, though.
This already is the idiomatic way, quite explict.
In [3]: df.val - df.groupby('key').val.transform('mean')
Out[3]:
0 -0.5
1 0.5
2 -0.5
3 0.5
Name: val, dtype: float64
Fair enough on Series
, though when working with frames you'd have to explicitly drop the grouped columns
In [14]: df - df.groupby('key').transform('mean')
Out[14]:
key val
0 NaN -0.5
1 NaN 0.5
2 NaN -0.5
3 NaN 0.5
So I guess the question becomes do we think abstracting all of this through support for GroupBy operations is worth it or would we rather live with some slight inconsistencies in how to calculate across the various objects
Another use case is when the normalization values are pre-computed, perhaps from another dataset, e.g., you only have access todf.groupby('key').mean()
. You can certainly still apply these in pandas, but it requires explicit reindexing or merging.
@WillAyd - Just taking a look through old issues to see if we have any that can be closed. Have your thoughts here evolved at all? For the DataFrame case you can do:
print(df - df.groupby('key')[df.columns].transform('mean'))
# key val
# 0 0.0 -0.5
# 1 0.0 0.5
# 2 0.0 -0.5
# 3 0.0 0.5
Though I'm not sure that would be the indented result; should key be left untouched rather than normalized? I imagine that would be the result of df.groupby('key') - df.groupby('key').transform('mean')
too, but not sure.
xref some of the conversation in #20024 right now the following is possible
But trying to do something similar with grouped data does not work:
I am proposing that we update the
GroupBy
class to allow numerical operations with the result of aggregations or transformations against that object. Note that this is possible today through a much more verbose and hackish:The
Series
/DataFrame
operations are all added viaadd_special_arithmetic_methods
with their implementations being defined inops.py
. We could leverage a similar mechanism forGroupBy
Why is this worth doing?
Series
,DataFrame
andGroupBy
objectsmad
(see #20024)Why may it not be worth doing?
GroupBy
class that is already in need of refactorConsideration Points With this proposal, the left operand would always be a
GroupBy
object and the right operand would always be a the result of a function application against that sameGroupBy
. The result of the operation should be aSeries
orDataFrame
like-indexed to the original object.That said, the following operations would in theory be identical:
I'm not sure if we care to differentiate between these and force users into choosing one or the other.
Thoughts?