pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

Deprecation of relabeling dicts in groupby.agg brings many issues #18366

Closed zertrin closed 5 years ago

zertrin commented 6 years ago

This issue is created based on the discussion from #15931 following the deprecation of relabeling dicts in groupby.agg. A lot of what is summarized below was already discussed in the previous discussion. I would recommend in particular https://github.com/pandas-dev/pandas/pull/15931#issuecomment-336139085 where the problems are also clearly stated.

The motivation behind the deprecation of #15931 was mostly related to bringing a consistent interface for agg() between Series and Dataframe (see also #14668 for context).

The relabeling functionality with a nested dict has been described by some as being too complex and/or inconsistent and thus deprecated.

However, this comes at a price: the impossibility to aggregate and rename at the same time leads to very annoying issues and some backward incompatibility where no sensible workaround is available:

Example

(please note, this is a crafted example for the purpose of demonstrating the problem in as short a code as possible, but all of the demonstrated issues here did bite me in real life since the change, and in situations not as simple as here)

Input Dataframe

mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6)
)
  cat  distance  energy
0   A      1.20    1.80
1   A      1.50    1.95
2   A      1.74    2.04
3   B      0.82    1.25
4   B      1.01    1.60
5   C      0.60    1.01

Before:

easy to write and read, and works as expected

import numpy as np
import statsmodels.robust as smrb
from functools import partial

# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)

# renaming and specifying the aggregators at the same time
# note that I want to choose the resulting column names myself
# for example "total_xxxx" instead of just "sum"
mydf_agg = mydf.groupby('cat').agg({
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),  # lambda
        'energy_p17': lambda x: np.percentile(x, 17),  # lambda
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,   # original function
        'distance_mad_c1': mad_c1,  # partial function wrapping the original function
    },
})

results in

          energy                             distance
    total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A           5.79     2.0364     1.8510           4.44            1.480     0.355825           0.240
B           2.85     1.5930     1.3095           1.83            0.915     0.140847           0.095
C           1.01     1.0100     1.0100           0.60            0.600     0.000000           0.000

and all is left is:

# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

Happy dance praising pandas 💃 🕺 !

After

import numpy as np
import statsmodels.robust as smrb
from functools import partial

# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)

# no way of choosing the destination's column names...
mydf_agg = mydf.groupby('cat').agg({
    'energy': [
        'sum',
        lambda x: np.percentile(x, 98), # lambda
        lambda x: np.percentile(x, 17), # lambda
    ],
    'distance': [
        'sum',
        'mean',
        smrb.mad, # original function
        mad_c1,   # partial function wrapping the original function
    ],
})

The above breaks because the lambda functions will all result in columns named <lambda> which results in

SpecificationError: Function names must be unique, found multiple named <lambda>

Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.

If one removes the lambda x: np.percentile(x, 98) from above, we get the same issue with the partial function which inherits the function name from the original function:

SpecificationError: Function names must be unique, found multiple named mad

Finally, after overwriting the __name__ attribute of the partial (for example with mad_c1.__name__ = 'mad_c1') we get:

    energy          distance
       sum <lambda>      sum   mean       mad mad_c1
cat
A     5.79   1.8510     4.44  1.480  0.355825  0.240
B     2.85   1.3095     1.83  0.915  0.140847  0.095
C     1.01   1.0100     0.60  0.600  0.000000  0.000

with still

to deal with in separate step.

There is no control possible for the column names after aggregation, the best we can get in an automated way is some combination of original column name and the aggregate function's name like this:

mydf_agg.columns = ['_'.join(col) for col in mydf_agg.columns]

which results in:

     energy_sum  energy_<lambda>  distance_sum  distance_mean  distance_mad distance_mad_c1
cat
A          5.79           1.8510          4.44          1.480      0.355825           0.240
B          2.85           1.3095          1.83          0.915      0.140847           0.095
C          1.01           1.0100          0.60          0.600      0.000000           0.000

and if you really need to have different names, you can do it like this:

mydf_agg.rename({
    "energy_sum": "total_energy",
    "energy_<lambda>": "energy_p17",
    "distance_sum": "total_distance",
    "distance_mean": "average_distance"
    }, inplace=True)

but that means that you need to be careful to keep the renaming code (which must now be located at another place in the code) in sync with the code where the aggregation is defined...

Sad pandas user 😢 (which still loves pandas of course)


I am all in for consistency, and at the same time I deeply regret the deprecation of the aggregate and rename functionality. I hope the examples above make the pain points clear.


Possible solutions


Optional read:

With respect to the aforementioned discussion in the pull request which has been going on already for a few months, I only recently realized one of the reasons why I am so bothered by this deprecation: "aggregate and rename" is a natural thing to do with GROUP BY aggregations in SQL since in SQL you usually provide the destination column name directly next to the aggregation expression, e.g. SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1.

I'm not saying that Pandas should necessarily provide the same functionalities as SQL of course. But the examples provided above demonstrate why the dict-of-dict API was in my opinion a clean and simple solution to many use-cases.

(* I don't personally agree that the dict-of-dict approach is complex.)

gfyoung commented 6 years ago

@zertrin : Thanks for putting this together. I saw that there was a lot of discussion back in #15931 about this. As I haven't been able to read this in full, I cannot comment at the moment. Nevertheless, let me ping:

@jreback @jorisvandenbossche @TomAugspurger @chris-b1

tdpetrou commented 6 years ago

I agree that renaming with the current agg implementation is very clunky and broken in this example. The nested dicts are somewhat complex but writing them as you did makes it very clear what is happening.

I suppose there could be a names parameter added to agg which would take dictionary mapping the aggregating columns to their new names. You could even add another parameter drop_index as a boolean to determine whether to keep the upper index level.

So the syntax would turn into:

agg_dict = {'energy': ['sum',
                       lambda x: np.percentile(x, 98), # lambda
                       lambda x: np.percentile(x, 17), # lambda
                      ],
            'distance': ['sum',
                         'mean',
                         smrb.mad, # original function
                         mad_c1,   # partial function wrapping the original function
                        ]
           }

name_dict = {'energy':['energy_sum', 'energy_p98', 'energy_p17'],
             'distance':['distance_sum', 'distance_mean', 'distance_mad', 'distance_mad_c1']}

mydf.groupby('cat').agg(agg_dict, names=name_dict, drop_index=True)

Or maybe, an entire new method agg_assign could be created, which would work similarly to DataFrame.assign:

mydf.groupby('cat').agg_assign(energy_sum=lambda x: x.energy.sum(),
                               energy_p98=lambda x: np.percentile(x.energy, 98),
                               energy_p17=lambda x: np.percentile(x.energy, 17),
                               distance_sum=lambda x: x.distance.sum(),
                               distance_mean=lambda x: x.distance.mean(),
                               distance_mad=lambda x: smrb.mad(x.distance),
                               distance_mad_c1=lambda x: mad_c1(x.distance))

I actually like this option much better.

smcateer commented 6 years ago

For what it's worth, I am also strongly in favour of not depreciating the functionality.

A big reason for me is that there is something deeply queer about mixing the Python's function name-space (something to do with the particular implementation) with the data the column names (something that should surely not know about the implementation). The fact that we are seeing columns (potentially multiple columns) named '<lambda>' is causes me severe cognitive dissonance.

The renaming approach grates, because there is this intermediary step where unnecessary (and exposed) column names are carried around. Furthermore, they are difficult to reliably, systematically rename because there are potentially dependencies on the implementation.

Aside from that, the nested dict functionality is admittedly complex, but it is a complex operation that is being performed.

TL;DR Please don't depreciate. :)

pirsquared commented 6 years ago

My contribution is motivated by two things.

  1. I'm aware of and agree with the motivation to reduce the bloated API of Pandas. Even if I'm misguided in regards to the perceived motivation to reduce "bloated" API elements, It is still my opinion that Pandas' API could be streamlined.
  2. I think it is better to have a good cookbook with good recipes than provide API to satisfy everyone's wants and desires. I'm not claiming that the renaming via nested dictionaries qualifies as satisfying whims as it already existed and we are discussing it's deprecation. But it does lie on the spectrum between streamlined API and something... else.

Also, the Pandas Series and DataFrame objects have had pipe methods to facilitate pipelining. In this doc segment it is discussed that we could use pipe to proxy for methods in lieu of subclassing. In the same spirit, we could use the new GroupBy.pipe to perform a similar role and allow us to build proxy methods for groupby objects.

I'll use @zertrin 's example

import numpy as np
import statsmodels.robust as smrb
from functools import partial

# The DataFrame offered up above
mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6)
)

# Identical dictionary passed to `agg`
funcs = {
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),  # lambda
        'energy_p17': lambda x: np.percentile(x, 17),  # lambda
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,   # original function
        'distance_mad_c1': mad_c1,  # partial function wrapping the original function
    },
}

# Write a proxy method to be passed to `pipe`
def agg_assign(gb, fdict):
    data = {
        (cl, nm): gb[cl].agg(fn)
        for cl, d in fdict.items()
        for nm, fn in d.items()
    }
    return pd.DataFrame(data)

# All the API we need already exists with `pipe`
mydf.groupby('cat').pipe(agg_assign, fdict=funcs)

Which results in

            distance                                                 energy                        
    average_distance distance_mad distance_mad_c1 total_distance energy_p17 energy_p98 total_energy
cat                                                                                                
A              1.480     0.355825           0.240           4.44     1.8510     2.0364         5.79
B              0.915     0.140847           0.095           1.83     1.3095     1.5930         2.85
C              0.600     0.000000           0.000           0.60     1.0100     1.0100         1.01

The pipe method makes adding new API unnecessary in many cases. It also, provides the means for a replacement for the deprecated functionality we are discussing. Therefore, I'd be inclined to go forward with the deprecation.

maxu777 commented 6 years ago

I really like tdpetrou's idea - to use: names=name_dict.

This can make everyone happy. It gives us a possibility to rename columns easily as we wish.

zertrin commented 6 years ago

Not really, as mentioned in my initial post, this would not solve the issue of decoupling the place where the aggregate operation is defined from the name of the resulting column, requiring an extra effort to make sure both are "synchronized".

I don't say that's a bad solution (after all it solves the other issues), but it wouldn't be as easy and clear as the dict of dict approach. I mean here that at writing time you need to keep both dicts of lists synchronized, and when reading the source, the reader must make an effort to match the names in the second dict of lists with the aggregate definition in the first dict of lists. That's twice the effort in each case.

The nested dicts are somewhat complex but writing them as you did makes it very clear what is happening.

I still don't understand why everyone seems to say that dict of dict is complex. To me that's the clearest way of doing it.

That said, if the names keyword is the only solution the pandas team is comfortable with, that would still be an improvement over current situation.

zertrin commented 6 years ago

@pirsquared interesting solution with current API. Albeit in my opinion not quite easy to grasp (i don't really understand how it works :confused: )

tdpetrou commented 6 years ago

I started a thread on the datascience subreddit - What do you hate about pandas?. Someone brought up their contempt for the returned MultiIndex after a groupby and pointed to the dplyr do verb which is implemented in plydata. It happens to work exactly as agg_assign so that was quite interesting.

@zertrin agg_assign would be superior to your dict of dict approach and be identical to sql aggregations as well as allow for multiple columns to interact with one another within the aggregation. It would also work identically to DataFrame.assign.

Any thoughts @jreback @TomAugspurger ?

has2k1 commented 6 years ago

... mydf.groupby('cat').agg(agg_dict, names=name_dict, drop_index=True)

Although this solves the problem, one needs to align keys and values in two places. I think an API (as suggested for .agg_assign) that does not require such book-keeping code is less error prone.

There is also the issue of the clean up code after using the API. When groupby operations return a MultiIndex dataframe, in most cases the user undoes the MultiIndex. The straight forward declarative way of using .agg_assign, suggests no hierarchy, no MultiIndex output, no clean up afterwards.

Based on usage patterns, I think multi-index outputs should be strictly opt-in and not opt-out.

zertrin commented 6 years ago

I was initially skeptical about the agg_assign proposition, but the last two comments have convinced me that this could be a good solution.

Especially thinking about the possibility to use it in the form agg_assign(**relabeling_dict) and thus be able to define my relabeling_dict like this:

relabeling_dict = {
    'energy_sum': lambda x: x.energy.sum(),
    'energy_p98': lambda x: np.percentile(x.energy, 98),
    'energy_p17': lambda x: np.percentile(x.energy, 17),
    'distance_sum': lambda x: x.distance.sum(),
    'distance_mean': lambda x: x.distance.mean(),
    'distance_mad': lambda x: smrb.mad(x.distance),
    'distance_mad_c1': lambda x: mad_c1(x.distance)
}

That would be quite flexible and solve all the issues mentioned in my OP.

tdpetrou commented 6 years ago

@zertrin @has2k1

I was thinking about this a bit more and this functionality exists already with apply. You simply return a Series with index as the new column names and values as the aggregation. This allows for spaces in the name and gives you the ability to order columns exactly how you wish:

def my_agg(x):
    data = {'energy_sum': x.energy.sum(),
            'energy_p98': np.percentile(x.energy, 98),
            'energy_p17': np.percentile(x.energy, 17),
            'distance sum' : x.distance.sum(),
            'distance mean': x.distance.mean(),
            'distance MAD': smrb.mad(x.distance),
            'distance MAD C1': mad_c1(x.distance)}
    return pd.Series(data, index=list_of_column_order)

mydf.groupby('cat').apply(my_agg)

So, there might not be a need for a new method and instead just a better example in the docs.

has2k1 commented 6 years ago

@tdpetrou, you are correct. I had forgotten how apply works as I use my own version because of the double execution in fast-slow path selection process.

zertrin commented 6 years ago

Hum indeed, there is no chance I would have thought about using it in a aggregation context just by reading the doc however... Moreover, I still find the solution with apply a bit too convoluted. The agg_assign approach seemed more straightforward and understandable.

Since there was never really a statement about it, is the dict-of-dict approach (which, albeit currently deprecated, is already implemented and also solves all these issues) really definitely out of the question?

Except for the agg_assign approach, dict-of-dict still seems the most simple one, and doesn't need any coding, just un-deprecating.

TomAugspurger commented 6 years ago

The benefit and drawback of the agg_assign approach is that it pushes the column selection into the aggregation method. In all the examples, the x passed to the lambda is something like self.get_group(group) for each group in self, a DataFrameGroupBy object. This is nice because it cleanly separates the naming, which is in **kwargs, from the selection, which is in the function.

The drawback is that your nice, generic aggregation functions now have to be concerned with column selection. There's no free lunch! That means you'll end up with many helpers like lambda x: x[col].min. You'll also need to be careful with things like np.min which reduces over all dimensions, vs. pd.DataFrame.min, which reduces over axis=0. That's why something like agg_assign wouldn't be equivalent to apply. apply still operates column-wise for certain methods.

I'm not sure about these tradeoffs vs. the dict-of-dicts method, but I'm curious to hear other people's thoughts. Here's a rough sketch of agg_assign, which I've called which I've called agg_table to emphasize that the functions are being passed the tables, not columns:

from collections import defaultdict

import pandas as pd
import numpy as np
from pandas.core.groupby import DataFrameGroupBy

mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6)
)

def agg_table(self, **kwargs):
    output = defaultdict(dict)
    for group in self.groups:
        for k, v in kwargs.items():
            output[k][group] = v(self.get_group(group))

    return pd.concat([pd.Series(output[k]) for k in output],
                     keys=list(output),
                     axis=1)

DataFrameGroupBy.agg_table = agg_table

Usage


>>> gr = mydf.groupby("cat")
>>> gr.agg_table(n=len,
                 foo=lambda x: x.energy.min(),
                 bar=lambda y: y.distance.min())

   n   foo   bar
A  3  1.80  1.20
B  2  1.25  0.82
C  1  1.01  0.60

I suspect we could do a bit to make the performance of this less awful, but not nearly as much as .agg does...

maxu777 commented 6 years ago

Could someone from the Pandas Core Team please explain what is the main reason for deprecating of relabeling dicts in groupby.agg?

I could easily understand if it causes too much problems to maintain the code, but if it's about complexity for the end user - i would also opt for bringing it back, as it's pretty clear compared to needed workarounds...

Thank you!

TomAugspurger commented 6 years ago

Could someone from the Pandas Core Team please explain what is the main reason for deprecating of relabeling dicts in groupby.agg?

Did you see https://github.com/pandas-dev/pandas/pull/15931/files#diff-52364fb643114f3349390ad6bcf24d8fR461?

The primary reason was that dict-keys were overloaded to do two things. For Series / SeriesGroupBy, they're for naming. For DataFrame/DataFrameGroupBy, they're for selecting a column.

In [32]: mydf.aggregate({"distance": "min"})
Out[32]:
distance    0.6
dtype: float64

In [33]: mydf.aggregate({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[33]:
     distance
foo       0.6

In [34]: mydf.distance.agg({"foo": "min"})
Out[34]:
foo    0.6
Name: distance, dtype: float64

In [35]: mydf.groupby("cat").agg({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4201: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[35]:
    distance
         foo
cat
A       1.20
B       0.82
C       0.60

In [36]: mydf.groupby("cat").distance.agg({"foo": "min"})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[36]:
      foo
cat
A    1.20
B    0.82
C    0.60

This isn't the most confusing thing in pandas probably, so perhaps we could revisit it :) I'm presumably missing some edge cases. But even if we do remove dict-of-dicts aggregations, we still have the inconsistency between naming and column selection:

For Series / SeriesGroupBy the dictionary keys are always for naming the output.

For DataFrame / DataFrameGroupby, the dict keys are always for selection. With dict-of-dicts we select a column, and then the inner dict is for naming the output, just like Series / SeriesGroupBy.

jorisvandenbossche commented 6 years ago

We discussed this briefly before (somewhere in the long discussion about the deprecation), and I proposed something similar here: https://github.com/pandas-dev/pandas/pull/14668#issuecomment-274508089. But in the end only the deprecation was implemented, and not the ideas for making the other functionality of using dicts (the 'renaming' function) easier.

The problem was that dicts were both used for 'selection' (on which column do you want this function to be applied) and 'renaming' (what should be the resulting column name when applying this function). An alternative syntax, apart from dicts, could be keyword arguments, as is discussed here in the agg_assign proposal. I am still in favor to explore this possibility, whether it is in agg itself or in a new method like agg_assign.

What I proposed back then was something similar to agg_assign but using a dict per keyword instead of a lambda function. Translated to the example here, this would be something like:

mydf.groupby('cat').agg(
    energy_sum={'energy': 'sum'},
    energy_p98={'energy': lambda x: np.percentile(x, 98)},
    energy_p17={'energy': lambda x: np.percentile(x, 17)},
    distance_sum={'distance': 'sum'},
    distance_mean={'distance': 'mean'},
    distance_mad={'distance': smrb.mad},
    distance_mad_c1={'distance': mad_c1})

I am not sure this is necessarily more readable or easier to write as the version with all lambdas, but, this one could potentially be more performant, as pandas can still use the optimized implementations for sum, mean, etc on those columns where you do not have a lambda or user specified function.

A big question with this approach would be what df.groupby('cat').agg(foo='mean') would mean? That would logically apply 'mean' to all columns since you didn't make any selection (similar to {'col1' : {'foo': 'mean'}, 'col2': {'foo':'mean'}, 'col3': ...} before). But, that would result in multi-indexed columns, while in the example above I think it would be nice to not end up with MI columns.

I think the above can be done backwards compatible inside the existing agg, but the question is whether this is needed. I also think this would nicely extend to the series case like:

mydf.groupby('cat').distance.agg(
    distance_sum='sum',
    distance_mean='mean',
    distance_mad=smrb.mad,
    distance_mad_c1=mad_c1)

(and you could even consider to do the above one time for 'distance' and once for 'energy' and concat the result if you don't like all the dicts / lambda's)

@TomAugspurger In your example simple implementation of agg_table, wouldn't it be better to iterate over the different functions to be applied, instead of iterating of the groups, and in the end concatting the new columns by axis=1 instead of concatting the newly formed rows by axis=0 ?

jorisvandenbossche commented 6 years ago

BTW, @zertrin @tdpetrou @smcateer @pirsquared and others, thanks a lot for raising this issue and giving such detailed feedback. Such feedback and community involvement is very important!

smcateer commented 6 years ago

I actually really like the pattern suggested by @tdpetrou (using apply with a function that returns a Series) - probably even better than the dict of dicts.

If the function returns pd.Series(data, index=data.keys()) are we guaranteed to get the indices in the right order? (Just thinking about how best to implement the pattern in my code - at the risk of drifting off-topic).

Edit: sorry, I misunderstood the point of the index argument (it is optional here, only needed if you want to specify the order of the columns - returning pd.Series(data) does the job for me).

reesehopkins commented 6 years ago

Would @tdpetrou's example work with first & last aggregations?

I had to resort to head/tail like this

def agg_funcs(x):
    data = {'start':x['DATE_TIME'].head(1).values[0],
           'finish':x['DATE_TIME'].tail(1).values[0],
           'events':len(x['DATE_TIME'])}
    return pd.Series(data, index = list(data.keys()))

results = df.groupby('col').apply(agg_funcs)
TomAugspurger commented 6 years ago

I'd still like to address this, but I don't think it'll be done for 0.23.

BodonFerenc commented 6 years ago

Could @tdpetrou's approach work without defining a function that we will never use again in our code? Coming from a Q/Kdb+ world (similar to SQL) I am confused why we need to create any temporal variable/function for a simple select statement.

zertrin commented 6 years ago

OP here.

Honestly, after all this time and the plenty of discussion in #15931 and here, I am still not convinced that this is a good idea to deprecate relabeling dicts.

In the end, none of the alternatives proposed here are more intuitive to the users than the current relabeling dict approach IMHO. When it was in the documentation, just with one example it was clear how this works, and it is very flexible.

Of course pandas developers may still think otherwise, just chiming in with the point of view of an user.

tdpetrou commented 6 years ago

Even the relabeling dict approach is not very intuitive. In my opinion the syntax should be similar to SQL - func(column_name) as new_column_name. In Python we can do this with a three-item tuple. (func, column_name, new_column_name). This is how dexplo does groupby aggregation.

dexplo

jorisvandenbossche commented 6 years ago

@zertrin do you have feedback on my proposal above: https://github.com/pandas-dev/pandas/issues/18366/#issuecomment-349089667 In the end, it kind of inverts the order of the dict: instead of "{col: {name: func}}" it would be kind of "**{name: {col: func}}"

zertrin commented 6 years ago

@jorisvandenbossche I have considered your approach. The thing is, I don't really see what additional advantages it brings over the current approach.

To put it more bluntly, given the following choices:

  1. Undeprecate current behaviour which works well (a few lines of deprecation code to remove, re-add the piece of documentation that was removed)
  2. Implement your proposal (significant changes to be made in the code, pursue with deprecation of current approach, necessity for all users to adapt their code)

I don't see why we should choose 2 unless it brings meaningful and tangible advantages from a developer and user perspective.

To address some of the points in your proposal above:

The problem was that dicts were both used for 'selection' (on which column do you want this function to be applied) and 'renaming' (what should be the resulting column name when applying this function).

Since it was nicely documented before, I don't believe it was an issue for users. Personally, I got the point immediately looking at the examples in the documentation. (EDIT: and I thought: "yay! very useful construct, it exactly matches what I was looking for. Nice.")

An alternative syntax, apart from dicts, could be keyword arguments

One of the attractive thing to use the dict-of-dict approach is that users can be easily generate it dynamically with some other code. As you pointed out in the comment just above this one, moving to keyword arguments as in your proposition would still allow for this via the **{name: {col: func}} construct. So I'm not against your proposal. I just don't see the value added and the necessity of such changes when we already achieve the same level of functionality with the current implemented system.

In the end, you proposal would be okay if pandas core dev have a strong feeling against the current approach. I just don't see any benefit as a user. (in fact I see the drawback of changing all existing user code to make it work again with the new proposition).

jorisvandenbossche commented 6 years ago

@zertrin we discussed this yesterday with some core devs, but didn't get to writing the summary here. So now I am going to do that, before answering to your comment, to only reflect our thoughts of yesterday.


So first to state, the notion that a basic functionality like the SQL "SELECT avg(col2) as col2_avg" should work and be easy, is something we completely agree on, and we really want to have a solution for this.

Apart from the original reasons we decided to deprecate this (which may or may not be that strong), the current (deprecated) dicts of dicts is also not that ideal, as this creates a MultiIndex that you actually never want:

In [1]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': range(3), 'C': [.1, .2, .3]})

In [3]: gr = df.groupby('A')

In [4]: gr.agg({'B': {'b_sum': 'sum'}, 'C': {'c_mean': 'mean', 'c_count': 'count'}})
Out[4]: 
        C            B
  c_count c_mean b_sum
A                     
a       2    0.2     2
b       1    0.2     1

In the above, the first level of the MultiIndex is superfluous, as you already specifically renamed the columns (in the example in the OP, this is also directly followed by dropping the first level of the columns). It is however hard to change this because you can also do things like gr.agg(['sum', 'mean']) or (mixed) gr.agg({'B': ['sum', 'mean'], 'C': {'c_mean': 'mean', 'c_count': 'count'}}) were the MultiIndex is needed and makes sense.

So one of the proposals that was mentioned in the discussion above, was to have a way to specify the final column names separately (eg https://github.com/pandas-dev/pandas/issues/18366#issuecomment-346683449). Adding eg an extra keyword to aggregate to specify the column names, like

gr.agg({'B': 'sum', 'C': ['mean', 'count']}, columns=['b_sum', 'c_mean', 'c_count'])

would be possible. However, if we split the column/function specification and the new column names, we can also make this more generic than a new keyword, and do something like:

gr.agg({'B': 'sum', 'C': ['mean', 'count']}).rename(columns=['b_sum', 'c_mean', 'c_count'])

This needs https://github.com/pandas-dev/pandas/issues/14829 to be solved (something we want to do for 0.24.0). (important note: for this we do need to fix the duplicate names problem of lambda functions, so we should do some kind of automatic deduplication of the names if we want to support this solution.)


Then, we still like the way of the keyword arguments for renaming. Reasons for this are:

We still had a bit of discussion how it could look like. What I proposed above was (to use the equivalent column/function selection as in my first examples):

gr.agg(b_sum={'B': 'sum'}, c_mean={'C': 'mean'}, c_count={'C': 'count'})

You still can build up this specification as a dict of dicts, but with the inner and outer level swapped compared to the current (deprecated) version:

gr.agg(**{'b_sum': {'B': 'sum'}, 'c_mean': {'C': 'mean'}, 'c_count': {'C': 'count'})

(we could have a example helper function that converts the existing dicts of dicts to this version)

However, the dict is always only a single {col: func}, and those multiple single element dicts look a bit strange. So an alternative we thought of is to use tuples:

gr.agg(b_sum=('B', 'sum'), c_mean=('C', 'mean'), c_count=('C', 'count'))

This looks a bit better, but on the other hand the {'B': 'sum'} dict is consistent with the other APIs for specifying the column on which to apply the function.


Both suggestions above (the easier renaming afterwards, and the keyword-based naming) are in principle orthogonal, but it could be nice to have both (or still something else based on further discussion)

zertrin commented 6 years ago

Thanks for forwarding here the current thoughts from the devs 😃

I acknowledge the (, in my opinion, only) drawback of the deprecated dict-of-dict approach with the resulting MultiIndex. Could be flattened if the user pass an additional option (yeah YAO :-/ ).

As mentioned, I am not against the second version, as long as it stays possible to:

As such, the last suggestion (with dicts or tuples for the col>func mapping) is okay I think.

The first proposition in the previous comment can be implemented if you really want to, but my feedback on this is that, as a user, I wouldn't choose to use it over the second alternative because of the pain of keeping things in sync between two lists.

TomAugspurger commented 6 years ago

Discussed at the dev meeting today.

Short summary

  1. @jorisvandenbossche will try to implement gr.agg(b_sum=("B", "sum), ...), i.e. when there's no arg passed to *GroupBy.agg, interpret kwargs as <output_name>=(<selection>, <aggfunc>)
  2. Orthogonal to this issues, we'd like to implement MutliIndex.flatten and provide a flatten=True keyword to .agg
plankthom commented 5 years ago

Maybe this helps: my workaround for the deprecation are these helper functions that replaces the alias->aggr maps with list of correctly named functions:

def aliased_aggr(aggr, name):
    if isinstance(aggr,str):
        def f(data):
            return data.agg(aggr)
    else:
        def f(data):
            return aggr(data)
    f.__name__ = name
    return f

def convert_aggr_spec(aggr_spec):
    return {
        col : [ 
            aliased_aggr(aggr,alias) for alias, aggr in aggr_map.items() 
        ]  
        for col, aggr_map in aggr_spec.items() 
    }

which gives the old behaviour with:

mydf_agg = mydf.groupby('cat').agg(convert_aggr_spec{
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),  # lambda
        'energy_p17': lambda x: np.percentile(x, 17),  # lambda
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,   # original function
        'distance_mad_c1': mad_c1,  # partial function wrapping the original function
    },
}))

which is the same as

mydf_agg = mydf.groupby('cat').agg({
    'energy': [ 
        aliased_aggr('sum', 'total_energy'),
        aliased_aggr(lambda x: np.percentile(x, 98), 'energy_p98'),
        aliased_aggr(lambda x: np.percentile(x, 17), 'energy_p17')
    ],
    'distance': [
         aliased_aggr('sum', 'total_distance'),
         aliased_aggr('mean', 'average_distance'),
         aliased_aggr(smrb.mad, 'distance_mad'),
         aliased_aggr(mad_c1, 'distance_mad_c1'),
    ]
})

This works for me, but probably won't work in some corner cases ...

Update: found out that renaming is not necessary, as tuples in an aggregation specification are interpreted as (alias, aggr). So the alias_aggr function is not needed, and the conversion becomes:

def convert_aggr_spec(aggr_spec):
    return {
        col : [ 
           (alias,aggr) for alias, aggr in aggr_map.items() 
        ]  
        for col, aggr_map in aggr_spec.items() 
    }
kasuteru commented 5 years ago

I just want to chime in here as yet another user who is really, really missing the functionality of aggregating a column on any function and immediately renaming it in the same row. I have never found myself using the MultiIndex returned by pandas - I either flatten it immediately, or I actually want to manually specify my column names because they actually mean something specific.

I would be happy with any of the approaches proposed here: SQL-like syntax ( I actually find myself using .query() a lot in pandas already), reverting to the depreciated behavior, any of the other suggestions. The current approach already brought me ridicule from colleagues who use R.

I recently even found myself using PySpark instead of pandas even though it wasn't necessary, just because I like the syntax so much more:

df.groupby("whatever").agg(
    F.max("col1").alias("my_max_col"),
    F.avg("age_col").alias("average_age"),
    F.sum("col2").alias("total_yearly_payments")
)

Also PySpark is way more convoluted to write than pandas in most cases, this just looks so much cleaner! So I definitely appreciate that work on this is still be done :-)

TomAugspurger commented 5 years ago

I think we have an agreed syntax for this functionality; we need someone to implement it.

On Wed, Mar 27, 2019 at 9:01 AM Thomas Kastl notifications@github.com wrote:

I just want to chime in here as yet another user who is really, really missing the functionality of aggregating a column on any function and immediately renaming it in the same row. I have never found myself using the MultiIndex returned by pandas - I either flatten it immediately, or I actually want to manually specify my column names because they actually mean something specific.

I would be happy with any of the approaches proposed here: SQL-like syntax ( I actually find myself using .query() a lot in pandas already), reverting to the depreciated behavior, any of the other suggestions. The current approach already brought me ridicule from colleagues who use R.

I recently even found myself using PySpark instead of pandas even though it wasn't necessary, just because I like the syntax so much more:

df.groupby("whatever").agg( F.max("col1").alias("my_max_col"), F.avg("age_col").alias("average_age"), F.sum("col2").alias("total_yearly_payments") )

Also PySpark is way more convoluted to write than pandas in most cases, this just looks so much cleaner! So I definitely appreciate that work on this is still be done :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/18366#issuecomment-477168767, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIkCYYsah5siYA4_z0oop_ufIB3h8ks5va3nJgaJpZM4QjSLL .

TomAugspurger commented 5 years ago

I'm going to try to get to this for 0.25.0

TomAugspurger commented 5 years ago

I've put up a PR at https://github.com/pandas-dev/pandas/pull/26399. The basic idea is to allow this mixture of rename & column-specific aggregation by using **kwargs with the understanding that the values should be tuples of (selection, aggfunc).

In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                    'height': [9.1, 6.0, 9.5, 34.0],
   ...:                    'weight': [7.9, 7.5, 9.9, 198.0]})

In [3]: df
Out[3]:
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [4]: df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max'))
Out[4]:
      min_height  max_weight
kind
cat          9.1         9.9
dog          6.0       198.0

This has a few limitations

And there's an implementation detail, multiple lambda aggfuncs for the same column isn't yet supported, though that can be fixed later.


I suspect that most people subscribed here would be supportive of some alternative to the deprecated behavior. What do people think of this one specifically?

cc @WillAyd if I missed any of your concerns.

zertrin commented 5 years ago

Hi @TomAugspurger,

Thanks for making this move forward.

This has a few limitations

  • It's somewhat peculiar to the rest of pandas. The sytanx (output_name=(selection, aggfunc)) doesn't really appear anywhere else (though .assign does use the output_name=... pattern)

I can't help but feel that this kind of argument seems quite similar to the one that motivated deprecating the existing implementation at the first place.

Could you share why we benefit more from this new way over the old one with respect to that particular argument?

One benefit that I could already think of is that (for py3.6+) we can select the output order of the columns individually.

  • The spelling for output names that aren't python identifiers is ugly: .agg(**{'output name': (col, func)})

Somehow, the old way was better in that respect. But as I said earlier, as long as it is possible to use the **{...} construct to build the aggregation dynamically, I would be happy enough.

  • It's Python 3.6+ only, or we need some ugly hacks for 3.5 and earlier, since the order of **kwargs wasn't previously preserved

How did it work before (existing dict-of-dict feature)? was the order guaranteed in some way?

  • The aggfunc needs to be a unary function. If your custom aggfunc needs additional arguments, you'll need to partially apply it first

Just to confirm my understanding: the aggfunc can be any callable that returns a valid value right? (in addition to the "often used" string aggfungs like 'min', 'max', etc. ). Is there any difference compared to before? (i.e. wasn't the unary limitation already present?)

And there's an implementation detail, multiple lambda aggfuncs for the same column isn't yet supported, though that can be fixed later.

Yeah that one is kind of annoying, but as long as it is just a temporary limitation and it is open to fix this, that could work.

I suspect that most people subscribed here would be supportive of some alternative to the deprecated behavior. What do people think of this one specifically?

Well, in any case I think that aggregate and rename in one step is really important to keep. If the old behavior is really not an option, then this alternative could do.

TomAugspurger commented 5 years ago

Could you share why we benefit more from this new way over the old one with respect to that particular argument.

I may be misremembering, but I believe SeriesGroupby.agg and DataFrameGroupby.agg have different meanings between the outer key in a dictionary (is it a column selection or an output naming?). With this syntax, we can consistently have the keyword mean the output name.

Somehow, the old way was better in that respect.

Is the difference just the **? Otherwise I think the same limitations are shared.

How did it work before (existing dict-of-dict feature)? was the order guaranteed in some way?

Sorting the keys, which is what I'm doing over in the PR now.

Just to confirm my understanding: the aggfunc can be any callable that returns a valid value right?

Here's the difference

In [21]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})

In [22]: def aggfunc(x, myarg=None):
    ...:     print(myarg)
    ...:     return sum(x)
    ...:

In [23]: df.groupby("A").agg({'B': {'foo': aggfunc}}, myarg='bar')
/Users/taugspurger/sandbox/pandas/pandas/core/groupby/generic.py:1308: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super().aggregate(arg, *args, **kwargs)
None
Out[23]:
    B
  foo
A
a   3

with the alternative proposal, we're reserving **kwargs for output column names. So you would need to functools.partitial(aggfunc, myarg='bar').

zertrin commented 5 years ago

Ok thanks, I think the proposed approach is 👍 for a first iteration (and will really be ok as a replacement as soon as the multiple lambda implementation limitation is removed)