pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.82k stars 17.99k forks source link

ENH: Group names in a groupby should be `namedtuple` instead of `tuple` #40057

Open rben01 opened 3 years ago

rben01 commented 3 years ago

Is your feature request related to a problem?

The columns in a DataFrame have names. When you iterate over a groupby object and get the names of the groups (which correspond to values in the original DataFrame), the group names are returned as ordinary tuples and so their correspondence to the columns of the DataFrame is lost. Fields in the name tuple can only be accessed by their index; the more natural option of using the corresponding column name in the original DataFrame is not possible.

Describe the solution you'd like

When iterating over a groupby -- for name, group in df.groupby(...): -- name should be a namedtuple instead of an ordinary tuple so as to allow accessing fields by column name. Any other function that returns the name of a group should return it as a namedtuple and not a tuple. The ordinary considerations about what happens when an invalid Python identifier is used as a namedtuple field should occur (I think they get replaced with _1, _2, etc, but not sure about this), especially when the columns have names that are not strings.

API breaking implications

As far as I know, namedtuples expose a strict superset of the API of tuples, so there shouldn't be any breaking changes.

Describe alternatives you've considered

This very ugly solution:

grouping_cols = ["col1", "col2"]
g = df.groupby(grouping_cols)
name, _ = next(iter(g))
name_col1 = name[grouping_cols.index("col1")]
# now name_col1 is the value in the `name` tuple corresponding to the `"col1"` column

Additional context

Ideally I could simply write name.col1.

TomAugspurger commented 3 years ago

As far as I know, namedtuples expose a strict superset of the API of tuples, so there shouldn't be any breaking changes.

In the past we ran into issues with namedtuples not being pickleable, being limited to (I think) 255 fields (which may be less relevant for groupby than for itertuples, but I guess it's possible someone is grouping by >255 levels).

rben01 commented 3 years ago

I can create a namedtuple with 10,000 fields without an issue (although it does take some time), but the pickleability does seem like a problem. Maybe a groupby argument names_as_namedtuple: bool that controls this behavior to maintain backwards compatibility? I never attempt to pickle groupby names, so I'd definitely pass names_as_namedtuple=True to all my own groupbys.

rben01 commented 3 years ago

I also wonder about the feasibility of using a plain old dict for the name. They’d be able to have any valid column label as a key, and since they’re ordered, they’d have tuple(name_dict.values()) == name_tuple. And since dicts are built in, they're pickleable.

(In simple cases you could create a dict yourself by zipping the group name with the groupby columns, but this won’t work in general if the groupby columns are something other than the column labels, e.g., a pd.Grouper object)

nkvaltine commented 2 years ago

Zipping the name and columns together works for my case, so thanks for the tip, but I also think it should be a named tuple.