Open rben01 opened 3 years ago
As far as I know, namedtuples expose a strict superset of the API of tuples, so there shouldn't be any breaking changes.
In the past we ran into issues with namedtuples not being pickleable, being limited to (I think) 255 fields (which may be less relevant for groupby than for itertuples
, but I guess it's possible someone is grouping by >255 levels).
I can create a namedtuple
with 10,000 fields without an issue (although it does take some time), but the pickleability does seem like a problem. Maybe a groupby
argument names_as_namedtuple: bool
that controls this behavior to maintain backwards compatibility? I never attempt to pickle groupby names, so I'd definitely pass names_as_namedtuple=True
to all my own groupbys.
I also wonder about the feasibility of using a plain old dict
for the name
. They’d be able to have any valid column label as a key, and since they’re ordered, they’d have tuple(name_dict.values()) == name_tuple
. And since dicts are built in, they're pickleable.
(In simple cases you could create a dict yourself by zipping the group name with the groupby columns, but this won’t work in general if the groupby columns are something other than the column labels, e.g., a pd.Grouper
object)
Zipping the name and columns together works for my case, so thanks for the tip, but I also think it should be a named tuple.
Is your feature request related to a problem?
The columns in a DataFrame have names. When you iterate over a groupby object and get the names of the groups (which correspond to values in the original DataFrame), the group names are returned as ordinary
tuple
s and so their correspondence to the columns of the DataFrame is lost. Fields in thename
tuple can only be accessed by their index; the more natural option of using the corresponding column name in the original DataFrame is not possible.Describe the solution you'd like
When iterating over a groupby --
for name, group in df.groupby(...):
--name
should be anamedtuple
instead of an ordinarytuple
so as to allow accessing fields by column name. Any other function that returns the name of a group should return it as anamedtuple
and not a tuple. The ordinary considerations about what happens when an invalid Python identifier is used as anamedtuple
field should occur (I think they get replaced with_1
,_2
, etc, but not sure about this), especially when the columns have names that are not strings.API breaking implications
As far as I know,
namedtuple
s expose a strict superset of the API oftuple
s, so there shouldn't be any breaking changes.Describe alternatives you've considered
This very ugly solution:
Additional context
Ideally I could simply write
name.col1
.