Closed zertrin closed 5 years ago
@zertrin : Thanks for putting this together. I saw that there was a lot of discussion back in #15931 about this. As I haven't been able to read this in full, I cannot comment at the moment. Nevertheless, let me ping:
@jreback @jorisvandenbossche @TomAugspurger @chris-b1
I agree that renaming with the current agg
implementation is very clunky and broken in this example. The nested dicts are somewhat complex but writing them as you did makes it very clear what is happening.
I suppose there could be a names
parameter added to agg
which would take dictionary mapping the aggregating columns to their new names. You could even add another parameter drop_index
as a boolean to determine whether to keep the upper index level.
So the syntax would turn into:
agg_dict = {'energy': ['sum',
lambda x: np.percentile(x, 98), # lambda
lambda x: np.percentile(x, 17), # lambda
],
'distance': ['sum',
'mean',
smrb.mad, # original function
mad_c1, # partial function wrapping the original function
]
}
name_dict = {'energy':['energy_sum', 'energy_p98', 'energy_p17'],
'distance':['distance_sum', 'distance_mean', 'distance_mad', 'distance_mad_c1']}
mydf.groupby('cat').agg(agg_dict, names=name_dict, drop_index=True)
Or maybe, an entire new method agg_assign
could be created, which would work similarly to DataFrame.assign
:
mydf.groupby('cat').agg_assign(energy_sum=lambda x: x.energy.sum(),
energy_p98=lambda x: np.percentile(x.energy, 98),
energy_p17=lambda x: np.percentile(x.energy, 17),
distance_sum=lambda x: x.distance.sum(),
distance_mean=lambda x: x.distance.mean(),
distance_mad=lambda x: smrb.mad(x.distance),
distance_mad_c1=lambda x: mad_c1(x.distance))
I actually like this option much better.
For what it's worth, I am also strongly in favour of not depreciating the functionality.
A big reason for me is that there is something deeply queer about mixing the Python's function name-space (something to do with the particular implementation) with the data the column names (something that should surely not know about the implementation). The fact that we are seeing columns (potentially multiple columns) named '<lambda>'
is causes me severe cognitive dissonance.
The renaming approach grates, because there is this intermediary step where unnecessary (and exposed) column names are carried around. Furthermore, they are difficult to reliably, systematically rename because there are potentially dependencies on the implementation.
Aside from that, the nested dict functionality is admittedly complex, but it is a complex operation that is being performed.
TL;DR Please don't depreciate. :)
My contribution is motivated by two things.
Also, the Pandas Series and DataFrame objects have had pipe
methods to facilitate pipelining. In this doc segment it is discussed that we could use pipe
to proxy for methods in lieu of subclassing. In the same spirit, we could use the new GroupBy.pipe
to perform a similar role and allow us to build proxy methods for groupby objects.
I'll use @zertrin 's example
import numpy as np
import statsmodels.robust as smrb
from functools import partial
# The DataFrame offered up above
mydf = pd.DataFrame(
{
'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
},
index=range(6)
)
# Identical dictionary passed to `agg`
funcs = {
'energy': {
'total_energy': 'sum',
'energy_p98': lambda x: np.percentile(x, 98), # lambda
'energy_p17': lambda x: np.percentile(x, 17), # lambda
},
'distance': {
'total_distance': 'sum',
'average_distance': 'mean',
'distance_mad': smrb.mad, # original function
'distance_mad_c1': mad_c1, # partial function wrapping the original function
},
}
# Write a proxy method to be passed to `pipe`
def agg_assign(gb, fdict):
data = {
(cl, nm): gb[cl].agg(fn)
for cl, d in fdict.items()
for nm, fn in d.items()
}
return pd.DataFrame(data)
# All the API we need already exists with `pipe`
mydf.groupby('cat').pipe(agg_assign, fdict=funcs)
Which results in
distance energy
average_distance distance_mad distance_mad_c1 total_distance energy_p17 energy_p98 total_energy
cat
A 1.480 0.355825 0.240 4.44 1.8510 2.0364 5.79
B 0.915 0.140847 0.095 1.83 1.3095 1.5930 2.85
C 0.600 0.000000 0.000 0.60 1.0100 1.0100 1.01
The pipe
method makes adding new API unnecessary in many cases. It also, provides the means for a replacement for the deprecated functionality we are discussing. Therefore, I'd be inclined to go forward with the deprecation.
I really like tdpetrou's idea - to use: names=name_dict
.
This can make everyone happy. It gives us a possibility to rename columns easily as we wish.
Not really, as mentioned in my initial post, this would not solve the issue of decoupling the place where the aggregate operation is defined from the name of the resulting column, requiring an extra effort to make sure both are "synchronized".
I don't say that's a bad solution (after all it solves the other issues), but it wouldn't be as easy and clear as the dict of dict approach. I mean here that at writing time you need to keep both dicts of lists synchronized, and when reading the source, the reader must make an effort to match the names in the second dict of lists with the aggregate definition in the first dict of lists. That's twice the effort in each case.
The nested dicts are somewhat complex but writing them as you did makes it very clear what is happening.
I still don't understand why everyone seems to say that dict of dict is complex. To me that's the clearest way of doing it.
That said, if the names
keyword is the only solution the pandas team is comfortable with, that would still be an improvement over current situation.
@pirsquared interesting solution with current API. Albeit in my opinion not quite easy to grasp (i don't really understand how it works :confused: )
I started a thread on the datascience subreddit - What do you hate about pandas?. Someone brought up their contempt for the returned MultiIndex after a groupby
and pointed to the dplyr do
verb which is implemented in plydata. It happens to work exactly as agg_assign
so that was quite interesting.
@zertrin agg_assign
would be superior to your dict of dict approach and be identical to sql aggregations as well as allow for multiple columns to interact with one another within the aggregation. It would also work identically to DataFrame.assign
.
Any thoughts @jreback @TomAugspurger ?
... mydf.groupby('cat').agg(agg_dict, names=name_dict, drop_index=True)
Although this solves the problem, one needs to align keys and values in two places. I think an API (as suggested for .agg_assign
) that does not require such book-keeping code is less error prone.
There is also the issue of the clean up code after using the API. When groupby
operations return a MultiIndex
dataframe, in most cases the user undoes the MultiIndex
. The straight forward declarative way of using .agg_assign
, suggests no hierarchy, no MultiIndex
output, no clean up afterwards.
Based on usage patterns, I think multi-index outputs should be strictly opt-in and not opt-out.
I was initially skeptical about the agg_assign
proposition, but the last two comments have convinced me that this could be a good solution.
Especially thinking about the possibility to use it in the form agg_assign(**relabeling_dict)
and thus be able to define my relabeling_dict
like this:
relabeling_dict = {
'energy_sum': lambda x: x.energy.sum(),
'energy_p98': lambda x: np.percentile(x.energy, 98),
'energy_p17': lambda x: np.percentile(x.energy, 17),
'distance_sum': lambda x: x.distance.sum(),
'distance_mean': lambda x: x.distance.mean(),
'distance_mad': lambda x: smrb.mad(x.distance),
'distance_mad_c1': lambda x: mad_c1(x.distance)
}
That would be quite flexible and solve all the issues mentioned in my OP.
@zertrin @has2k1
I was thinking about this a bit more and this functionality exists already with apply
. You simply return a Series with index as the new column names and values as the aggregation. This allows for spaces in the name and gives you the ability to order columns exactly how you wish:
def my_agg(x):
data = {'energy_sum': x.energy.sum(),
'energy_p98': np.percentile(x.energy, 98),
'energy_p17': np.percentile(x.energy, 17),
'distance sum' : x.distance.sum(),
'distance mean': x.distance.mean(),
'distance MAD': smrb.mad(x.distance),
'distance MAD C1': mad_c1(x.distance)}
return pd.Series(data, index=list_of_column_order)
mydf.groupby('cat').apply(my_agg)
So, there might not be a need for a new method and instead just a better example in the docs.
@tdpetrou, you are correct. I had forgotten how apply
works as I use my own version because of the double execution in fast-slow path selection process.
Hum indeed, there is no chance I would have thought about using it in a aggregation context just by reading the doc however...
Moreover, I still find the solution with apply
a bit too convoluted. The agg_assign
approach seemed more straightforward and understandable.
Since there was never really a statement about it, is the dict-of-dict
approach (which, albeit currently deprecated, is already implemented and also solves all these issues) really definitely out of the question?
Except for the agg_assign
approach, dict-of-dict
still seems the most simple one, and doesn't need any coding, just un-deprecating.
The benefit and drawback of the agg_assign
approach is that it pushes the column selection into the aggregation method. In all the examples, the x
passed to the lambda
is something like self.get_group(group)
for each group in self
, a DataFrameGroupBy
object. This is nice because it cleanly separates the naming, which is in **kwargs
, from the selection, which is in the function.
The drawback is that your nice, generic aggregation functions now have to be concerned with column selection. There's no free lunch! That means you'll end up with many helpers like lambda x: x[col].min
. You'll also need to be careful with things like np.min
which reduces over all dimensions, vs. pd.DataFrame.min
, which reduces over axis=0
. That's why something like agg_assign
wouldn't be equivalent to apply
. apply
still operates column-wise for certain methods.
I'm not sure about these tradeoffs vs. the dict-of-dicts method, but I'm curious to hear other people's thoughts. Here's a rough sketch of agg_assign
, which I've called which I've called agg_table
to emphasize that the functions are being passed the tables, not columns:
from collections import defaultdict
import pandas as pd
import numpy as np
from pandas.core.groupby import DataFrameGroupBy
mydf = pd.DataFrame(
{
'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
},
index=range(6)
)
def agg_table(self, **kwargs):
output = defaultdict(dict)
for group in self.groups:
for k, v in kwargs.items():
output[k][group] = v(self.get_group(group))
return pd.concat([pd.Series(output[k]) for k in output],
keys=list(output),
axis=1)
DataFrameGroupBy.agg_table = agg_table
Usage
>>> gr = mydf.groupby("cat")
>>> gr.agg_table(n=len,
foo=lambda x: x.energy.min(),
bar=lambda y: y.distance.min())
n foo bar
A 3 1.80 1.20
B 2 1.25 0.82
C 1 1.01 0.60
I suspect we could do a bit to make the performance of this less awful, but not nearly as much as .agg
does...
Could someone from the Pandas Core Team please explain what is the main reason for deprecating of relabeling dicts in groupby.agg
?
I could easily understand if it causes too much problems to maintain the code, but if it's about complexity for the end user - i would also opt for bringing it back, as it's pretty clear compared to needed workarounds...
Thank you!
Could someone from the Pandas Core Team please explain what is the main reason for deprecating of relabeling dicts in groupby.agg?
Did you see https://github.com/pandas-dev/pandas/pull/15931/files#diff-52364fb643114f3349390ad6bcf24d8fR461?
The primary reason was that dict-keys were overloaded to do two things. For Series / SeriesGroupBy, they're for naming. For DataFrame/DataFrameGroupBy, they're for selecting a column.
In [32]: mydf.aggregate({"distance": "min"})
Out[32]:
distance 0.6
dtype: float64
In [33]: mydf.aggregate({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[33]:
distance
foo 0.6
In [34]: mydf.distance.agg({"foo": "min"})
Out[34]:
foo 0.6
Name: distance, dtype: float64
In [35]: mydf.groupby("cat").agg({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4201: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[35]:
distance
foo
cat
A 1.20
B 0.82
C 0.60
In [36]: mydf.groupby("cat").distance.agg({"foo": "min"})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[36]:
foo
cat
A 1.20
B 0.82
C 0.60
This isn't the most confusing thing in pandas probably, so perhaps we could revisit it :) I'm presumably missing some edge cases. But even if we do remove dict-of-dicts aggregations, we still have the inconsistency between naming and column selection:
For Series / SeriesGroupBy the dictionary keys are always for naming the output.
For DataFrame / DataFrameGroupby, the dict keys are always for selection. With dict-of-dicts we select a column, and then the inner dict is for naming the output, just like Series / SeriesGroupBy.
We discussed this briefly before (somewhere in the long discussion about the deprecation), and I proposed something similar here: https://github.com/pandas-dev/pandas/pull/14668#issuecomment-274508089. But in the end only the deprecation was implemented, and not the ideas for making the other functionality of using dicts (the 'renaming' function) easier.
The problem was that dicts were both used for 'selection' (on which column do you want this function to be applied) and 'renaming' (what should be the resulting column name when applying this function). An alternative syntax, apart from dicts, could be keyword arguments, as is discussed here in the agg_assign
proposal.
I am still in favor to explore this possibility, whether it is in agg
itself or in a new method like agg_assign
.
What I proposed back then was something similar to agg_assign
but using a dict per keyword instead of a lambda function. Translated to the example here, this would be something like:
mydf.groupby('cat').agg(
energy_sum={'energy': 'sum'},
energy_p98={'energy': lambda x: np.percentile(x, 98)},
energy_p17={'energy': lambda x: np.percentile(x, 17)},
distance_sum={'distance': 'sum'},
distance_mean={'distance': 'mean'},
distance_mad={'distance': smrb.mad},
distance_mad_c1={'distance': mad_c1})
I am not sure this is necessarily more readable or easier to write as the version with all lambdas, but, this one could potentially be more performant, as pandas can still use the optimized implementations for sum, mean, etc on those columns where you do not have a lambda or user specified function.
A big question with this approach would be what df.groupby('cat').agg(foo='mean')
would mean? That would logically apply 'mean' to all columns since you didn't make any selection (similar to {'col1' : {'foo': 'mean'}, 'col2': {'foo':'mean'}, 'col3': ...}
before). But, that would result in multi-indexed columns, while in the example above I think it would be nice to not end up with MI columns.
I think the above can be done backwards compatible inside the existing agg
, but the question is whether this is needed.
I also think this would nicely extend to the series
case like:
mydf.groupby('cat').distance.agg(
distance_sum='sum',
distance_mean='mean',
distance_mad=smrb.mad,
distance_mad_c1=mad_c1)
(and you could even consider to do the above one time for 'distance' and once for 'energy' and concat the result if you don't like all the dicts / lambda's)
@TomAugspurger In your example simple implementation of agg_table
, wouldn't it be better to iterate over the different functions to be applied, instead of iterating of the groups, and in the end concatting the new columns by axis=1 instead of concatting the newly formed rows by axis=0 ?
BTW, @zertrin @tdpetrou @smcateer @pirsquared and others, thanks a lot for raising this issue and giving such detailed feedback. Such feedback and community involvement is very important!
I actually really like the pattern suggested by @tdpetrou (using apply with a function that returns a Series) - probably even better than the dict of dicts.
If the function returns pd.Series(data, index=data.keys())
are we guaranteed to get the indices in the right order? (Just thinking about how best to implement the pattern in my code - at the risk of drifting off-topic).
Edit: sorry, I misunderstood the point of the index argument (it is optional here, only needed if you want to specify the order of the columns - returning pd.Series(data)
does the job for me).
Would @tdpetrou's example work with first
& last
aggregations?
I had to resort to head/tail like this
def agg_funcs(x):
data = {'start':x['DATE_TIME'].head(1).values[0],
'finish':x['DATE_TIME'].tail(1).values[0],
'events':len(x['DATE_TIME'])}
return pd.Series(data, index = list(data.keys()))
results = df.groupby('col').apply(agg_funcs)
I'd still like to address this, but I don't think it'll be done for 0.23.
Could @tdpetrou's approach work without defining a function that we will never use again in our code? Coming from a Q/Kdb+ world (similar to SQL) I am confused why we need to create any temporal variable/function for a simple select statement.
OP here.
Honestly, after all this time and the plenty of discussion in #15931 and here, I am still not convinced that this is a good idea to deprecate relabeling dicts.
In the end, none of the alternatives proposed here are more intuitive to the users than the current relabeling dict approach IMHO. When it was in the documentation, just with one example it was clear how this works, and it is very flexible.
Of course pandas developers may still think otherwise, just chiming in with the point of view of an user.
Even the relabeling dict approach is not very intuitive. In my opinion the syntax should be similar to SQL - func(column_name) as new_column_name
. In Python we can do this with a three-item tuple. (func, column_name, new_column_name)
. This is how dexplo does groupby aggregation.
@zertrin do you have feedback on my proposal above: https://github.com/pandas-dev/pandas/issues/18366/#issuecomment-349089667 In the end, it kind of inverts the order of the dict: instead of "{col: {name: func}}" it would be kind of "**{name: {col: func}}"
@jorisvandenbossche I have considered your approach. The thing is, I don't really see what additional advantages it brings over the current approach.
To put it more bluntly, given the following choices:
I don't see why we should choose 2 unless it brings meaningful and tangible advantages from a developer and user perspective.
To address some of the points in your proposal above:
The problem was that dicts were both used for 'selection' (on which column do you want this function to be applied) and 'renaming' (what should be the resulting column name when applying this function).
Since it was nicely documented before, I don't believe it was an issue for users. Personally, I got the point immediately looking at the examples in the documentation. (EDIT: and I thought: "yay! very useful construct, it exactly matches what I was looking for. Nice.")
An alternative syntax, apart from dicts, could be keyword arguments
One of the attractive thing to use the dict-of-dict approach is that users can be easily generate it dynamically with some other code. As you pointed out in the comment just above this one, moving to keyword arguments as in your proposition would still allow for this via the **{name: {col: func}}
construct. So I'm not against your proposal. I just don't see the value added and the necessity of such changes when we already achieve the same level of functionality with the current implemented system.
In the end, you proposal would be okay if pandas core dev have a strong feeling against the current approach. I just don't see any benefit as a user. (in fact I see the drawback of changing all existing user code to make it work again with the new proposition).
@zertrin we discussed this yesterday with some core devs, but didn't get to writing the summary here. So now I am going to do that, before answering to your comment, to only reflect our thoughts of yesterday.
So first to state, the notion that a basic functionality like the SQL "SELECT avg(col2) as col2_avg" should work and be easy, is something we completely agree on, and we really want to have a solution for this.
Apart from the original reasons we decided to deprecate this (which may or may not be that strong), the current (deprecated) dicts of dicts is also not that ideal, as this creates a MultiIndex that you actually never want:
In [1]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': range(3), 'C': [.1, .2, .3]})
In [3]: gr = df.groupby('A')
In [4]: gr.agg({'B': {'b_sum': 'sum'}, 'C': {'c_mean': 'mean', 'c_count': 'count'}})
Out[4]:
C B
c_count c_mean b_sum
A
a 2 0.2 2
b 1 0.2 1
In the above, the first level of the MultiIndex is superfluous, as you already specifically renamed the columns (in the example in the OP, this is also directly followed by dropping the first level of the columns).
It is however hard to change this because you can also do things like gr.agg(['sum', 'mean'])
or (mixed) gr.agg({'B': ['sum', 'mean'], 'C': {'c_mean': 'mean', 'c_count': 'count'}})
were the MultiIndex is needed and makes sense.
So one of the proposals that was mentioned in the discussion above, was to have a way to specify the final column names separately (eg https://github.com/pandas-dev/pandas/issues/18366#issuecomment-346683449).
Adding eg an extra keyword to aggregate
to specify the column names, like
gr.agg({'B': 'sum', 'C': ['mean', 'count']}, columns=['b_sum', 'c_mean', 'c_count'])
would be possible. However, if we split the column/function specification and the new column names, we can also make this more generic than a new keyword, and do something like:
gr.agg({'B': 'sum', 'C': ['mean', 'count']}).rename(columns=['b_sum', 'c_mean', 'c_count'])
This needs https://github.com/pandas-dev/pandas/issues/14829 to be solved (something we want to do for 0.24.0). (important note: for this we do need to fix the duplicate names problem of lambda functions, so we should do some kind of automatic deduplication of the names if we want to support this solution.)
Then, we still like the way of the keyword arguments for renaming. Reasons for this are:
assign
works in pandas, and also consistent with how groupby().aggregate()
works in ibis (and also similar to how it looks in eg dplyr in R)We still had a bit of discussion how it could look like. What I proposed above was (to use the equivalent column/function selection as in my first examples):
gr.agg(b_sum={'B': 'sum'}, c_mean={'C': 'mean'}, c_count={'C': 'count'})
You still can build up this specification as a dict of dicts, but with the inner and outer level swapped compared to the current (deprecated) version:
gr.agg(**{'b_sum': {'B': 'sum'}, 'c_mean': {'C': 'mean'}, 'c_count': {'C': 'count'})
(we could have a example helper function that converts the existing dicts of dicts to this version)
However, the dict is always only a single {col: func}
, and those multiple single element dicts look a bit strange. So an alternative we thought of is to use tuples:
gr.agg(b_sum=('B', 'sum'), c_mean=('C', 'mean'), c_count=('C', 'count'))
This looks a bit better, but on the other hand the {'B': 'sum'}
dict is consistent with the other APIs for specifying the column on which to apply the function.
Both suggestions above (the easier renaming afterwards, and the keyword-based naming) are in principle orthogonal, but it could be nice to have both (or still something else based on further discussion)
Thanks for forwarding here the current thoughts from the devs 😃
I acknowledge the (, in my opinion, only) drawback of the deprecated dict-of-dict approach with the resulting MultiIndex. Could be flattened if the user pass an additional option (yeah YAO :-/ ).
As mentioned, I am not against the second version, as long as it stays possible to:
**{}
construct, yay Python!)As such, the last suggestion (with dicts or tuples for the col>func mapping) is okay I think.
The first proposition in the previous comment can be implemented if you really want to, but my feedback on this is that, as a user, I wouldn't choose to use it over the second alternative because of the pain of keeping things in sync between two lists.
Discussed at the dev meeting today.
Short summary
gr.agg(b_sum=("B", "sum), ...)
, i.e. when there's no arg
passed to *GroupBy.agg
, interpret kwargs as <output_name>=(<selection>, <aggfunc>)
flatten=True
keyword to .agg
Maybe this helps: my workaround for the deprecation are these helper functions that replaces the alias->aggr maps with list of correctly named functions:
def aliased_aggr(aggr, name):
if isinstance(aggr,str):
def f(data):
return data.agg(aggr)
else:
def f(data):
return aggr(data)
f.__name__ = name
return f
def convert_aggr_spec(aggr_spec):
return {
col : [
aliased_aggr(aggr,alias) for alias, aggr in aggr_map.items()
]
for col, aggr_map in aggr_spec.items()
}
which gives the old behaviour with:
mydf_agg = mydf.groupby('cat').agg(convert_aggr_spec{
'energy': {
'total_energy': 'sum',
'energy_p98': lambda x: np.percentile(x, 98), # lambda
'energy_p17': lambda x: np.percentile(x, 17), # lambda
},
'distance': {
'total_distance': 'sum',
'average_distance': 'mean',
'distance_mad': smrb.mad, # original function
'distance_mad_c1': mad_c1, # partial function wrapping the original function
},
}))
which is the same as
mydf_agg = mydf.groupby('cat').agg({
'energy': [
aliased_aggr('sum', 'total_energy'),
aliased_aggr(lambda x: np.percentile(x, 98), 'energy_p98'),
aliased_aggr(lambda x: np.percentile(x, 17), 'energy_p17')
],
'distance': [
aliased_aggr('sum', 'total_distance'),
aliased_aggr('mean', 'average_distance'),
aliased_aggr(smrb.mad, 'distance_mad'),
aliased_aggr(mad_c1, 'distance_mad_c1'),
]
})
This works for me, but probably won't work in some corner cases ...
Update: found out that renaming is not necessary, as tuples in an aggregation specification are interpreted as (alias, aggr). So the alias_aggr function is not needed, and the conversion becomes:
def convert_aggr_spec(aggr_spec):
return {
col : [
(alias,aggr) for alias, aggr in aggr_map.items()
]
for col, aggr_map in aggr_spec.items()
}
I just want to chime in here as yet another user who is really, really missing the functionality of aggregating a column on any function and immediately renaming it in the same row. I have never found myself using the MultiIndex returned by pandas - I either flatten it immediately, or I actually want to manually specify my column names because they actually mean something specific.
I would be happy with any of the approaches proposed here: SQL-like syntax ( I actually find myself using .query()
a lot in pandas already), reverting to the depreciated behavior, any of the other suggestions. The current approach already brought me ridicule from colleagues who use R.
I recently even found myself using PySpark instead of pandas even though it wasn't necessary, just because I like the syntax so much more:
df.groupby("whatever").agg(
F.max("col1").alias("my_max_col"),
F.avg("age_col").alias("average_age"),
F.sum("col2").alias("total_yearly_payments")
)
Also PySpark is way more convoluted to write than pandas in most cases, this just looks so much cleaner! So I definitely appreciate that work on this is still be done :-)
I think we have an agreed syntax for this functionality; we need someone to implement it.
On Wed, Mar 27, 2019 at 9:01 AM Thomas Kastl notifications@github.com wrote:
I just want to chime in here as yet another user who is really, really missing the functionality of aggregating a column on any function and immediately renaming it in the same row. I have never found myself using the MultiIndex returned by pandas - I either flatten it immediately, or I actually want to manually specify my column names because they actually mean something specific.
I would be happy with any of the approaches proposed here: SQL-like syntax ( I actually find myself using .query() a lot in pandas already), reverting to the depreciated behavior, any of the other suggestions. The current approach already brought me ridicule from colleagues who use R.
I recently even found myself using PySpark instead of pandas even though it wasn't necessary, just because I like the syntax so much more:
df.groupby("whatever").agg( F.max("col1").alias("my_max_col"), F.avg("age_col").alias("average_age"), F.sum("col2").alias("total_yearly_payments") )
Also PySpark is way more convoluted to write than pandas in most cases, this just looks so much cleaner! So I definitely appreciate that work on this is still be done :-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/18366#issuecomment-477168767, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIkCYYsah5siYA4_z0oop_ufIB3h8ks5va3nJgaJpZM4QjSLL .
I'm going to try to get to this for 0.25.0
I've put up a PR at https://github.com/pandas-dev/pandas/pull/26399. The basic idea is to allow this mixture of rename & column-specific aggregation by using **kwargs
with the understanding that the values should be tuples of (selection, aggfunc)
.
In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
In [3]: df
Out[3]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [4]: df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max'))
Out[4]:
min_height max_weight
kind
cat 9.1 9.9
dog 6.0 198.0
This has a few limitations
(output_name=(selection, aggfunc))
doesn't really appear anywhere else (though .assign
does use the output_name=...
pattern).agg(**{'output name': (col, func)})
**kwargs
wasn't previously preservedAnd there's an implementation detail, multiple lambda
aggfuncs for the same column isn't yet supported, though that can be fixed later.
I suspect that most people subscribed here would be supportive of some alternative to the deprecated behavior. What do people think of this one specifically?
cc @WillAyd if I missed any of your concerns.
Hi @TomAugspurger,
Thanks for making this move forward.
This has a few limitations
- It's somewhat peculiar to the rest of pandas. The sytanx
(output_name=(selection, aggfunc))
doesn't really appear anywhere else (though.assign
does use theoutput_name=...
pattern)
I can't help but feel that this kind of argument seems quite similar to the one that motivated deprecating the existing implementation at the first place.
Could you share why we benefit more from this new way over the old one with respect to that particular argument?
One benefit that I could already think of is that (for py3.6+) we can select the output order of the columns individually.
- The spelling for output names that aren't python identifiers is ugly:
.agg(**{'output name': (col, func)})
Somehow, the old way was better in that respect. But as I said earlier, as long as it is possible to use the **{...}
construct to build the aggregation dynamically, I would be happy enough.
- It's Python 3.6+ only, or we need some ugly hacks for 3.5 and earlier, since the order of
**kwargs
wasn't previously preserved
How did it work before (existing dict-of-dict feature)? was the order guaranteed in some way?
- The aggfunc needs to be a unary function. If your custom aggfunc needs additional arguments, you'll need to partially apply it first
Just to confirm my understanding: the aggfunc can be any callable that returns a valid value right? (in addition to the "often used" string aggfungs like 'min'
, 'max'
, etc. ). Is there any difference compared to before? (i.e. wasn't the unary limitation already present?)
And there's an implementation detail, multiple
lambda
aggfuncs for the same column isn't yet supported, though that can be fixed later.
Yeah that one is kind of annoying, but as long as it is just a temporary limitation and it is open to fix this, that could work.
I suspect that most people subscribed here would be supportive of some alternative to the deprecated behavior. What do people think of this one specifically?
Well, in any case I think that aggregate and rename in one step is really important to keep. If the old behavior is really not an option, then this alternative could do.
Could you share why we benefit more from this new way over the old one with respect to that particular argument.
I may be misremembering, but I believe SeriesGroupby.agg and DataFrameGroupby.agg have different meanings between the outer key in a dictionary (is it a column selection or an output naming?). With this syntax, we can consistently have the keyword mean the output name.
Somehow, the old way was better in that respect.
Is the difference just the **
? Otherwise I think the same limitations are shared.
How did it work before (existing dict-of-dict feature)? was the order guaranteed in some way?
Sorting the keys, which is what I'm doing over in the PR now.
Just to confirm my understanding: the aggfunc can be any callable that returns a valid value right?
Here's the difference
In [21]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})
In [22]: def aggfunc(x, myarg=None):
...: print(myarg)
...: return sum(x)
...:
In [23]: df.groupby("A").agg({'B': {'foo': aggfunc}}, myarg='bar')
/Users/taugspurger/sandbox/pandas/pandas/core/groupby/generic.py:1308: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super().aggregate(arg, *args, **kwargs)
None
Out[23]:
B
foo
A
a 3
with the alternative proposal, we're reserving **kwargs
for output column names. So you would need to functools.partitial(aggfunc, myarg='bar')
.
Ok thanks, I think the proposed approach is 👍 for a first iteration (and will really be ok as a replacement as soon as the multiple lambda implementation limitation is removed)
This issue is created based on the discussion from #15931 following the deprecation of relabeling dicts in
groupby.agg
. A lot of what is summarized below was already discussed in the previous discussion. I would recommend in particular https://github.com/pandas-dev/pandas/pull/15931#issuecomment-336139085 where the problems are also clearly stated.The motivation behind the deprecation of #15931 was mostly related to bringing a consistent interface for
agg()
between Series and Dataframe (see also #14668 for context).The relabeling functionality with a nested dict has been described by some as being too complex and/or inconsistent and thus deprecated.
However, this comes at a price: the impossibility to aggregate and rename at the same time leads to very annoying issues and some backward incompatibility where no sensible workaround is available:
__name__
attributeExample
(please note, this is a crafted example for the purpose of demonstrating the problem in as short a code as possible, but all of the demonstrated issues here did bite me in real life since the change, and in situations not as simple as here)
Input Dataframe
Before:
easy to write and read, and works as expected
results in
and all is left is:
Happy dance praising pandas 💃 🕺 !
After
The above breaks because the lambda functions will all result in columns named
<lambda>
which results inBackward incompatible regression: one cannot apply two different lambdas to the same original column anymore.
If one removes the
lambda x: np.percentile(x, 98)
from above, we get the same issue with the partial function which inherits the function name from the original function:Finally, after overwriting the
__name__
attribute of the partial (for example withmad_c1.__name__ = 'mad_c1'
) we get:with still
to deal with in separate step.
There is no control possible for the column names after aggregation, the best we can get in an automated way is some combination of original column name and the aggregate function's name like this:
which results in:
and if you really need to have different names, you can do it like this:
but that means that you need to be careful to keep the renaming code (which must now be located at another place in the code) in sync with the code where the aggregation is defined...
Sad pandas user 😢 (which still loves pandas of course)
I am all in for consistency, and at the same time I deeply regret the deprecation of the aggregate and rename functionality. I hope the examples above make the pain points clear.
Possible solutions
Optional read:
With respect to the aforementioned discussion in the pull request which has been going on already for a few months, I only recently realized one of the reasons why I am so bothered by this deprecation: "aggregate and rename" is a natural thing to do with GROUP BY aggregations in SQL since in SQL you usually provide the destination column name directly next to the aggregation expression, e.g.
SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1
.I'm not saying that Pandas should necessarily provide the same functionalities as SQL of course. But the examples provided above demonstrate why the dict-of-dict API was in my opinion a clean and simple solution to many use-cases.
(* I don't personally agree that the dict-of-dict approach is complex.)