pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.1k stars 17.74k forks source link

ENH: Feature Request for Ungroup Method for Grouped Data Frames #43902

Open mdancho84 opened 2 years ago

mdancho84 commented 2 years ago

Hi, thanks for your work developing pandas. I'd like to request a feature to add an ungroup() method for grouped data frames. It's related to this StackOverflow where I've developed a hack to extract using the .obj to pull out the original data frame from the grouped data frame.

However, it would be helpful to have a method developed, which would do the extraction to prevent users from depending on my hack.

>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
    order_date          category_2     value
1   2011-02-01  Cross Country Race  324400.0
2   2011-03-01  Cross Country Race  142000.0
3   2011-04-01  Cross Country Race  498580.0
4   2011-05-01  Cross Country Race  220310.0
5   2011-06-01  Cross Country Race  364420.0
..         ...                 ...       ...
535 2015-08-01          Triathalon   39200.0
536 2015-09-01          Triathalon   75600.0
537 2015-10-01          Triathalon   58600.0
538 2015-11-01          Triathalon   70050.0
539 2015-12-01          Triathalon   38600.0

[531 rows x 3 columns]
jreback commented 2 years ago

-1 as pd.concat([grp for g, grp in df.groupby...()]) is idiomatic. this is not worth a method.

hnagaty commented 2 years ago

That would be a nice feature and may come handy at some times.

mdancho84 commented 2 years ago

It's definitely a handy tool to implement that would help beginners seeking to extract groups. It also parallels R's tidyverse which has ungroup() in dplyr, so it might make it easier for R users to transition to pandas.

s-pike commented 2 years ago

For me, this feature could be useful.

For reference, pd.concat([grp for g, grp in df.groupby...()]) doesn't seem to have quite the same output as df.groupby().obj. The former sorts the dataframe into groups, whereas the latter maintains the original order. The .obj hack is also an order of magnitude faster in my tests (even if you df.sort_values(group), although again, the results aren't identical).

It would really come into its own if the DataFrameGroupBy object had its own assign method, equivalent to dplyr's group_by %>% mutate functionality (see this stackoverflow question). If you like method chaining, the best current approach is using x.groupby('group').transform('fn')['value'], but that's potentially awkward if you want to use the group for multiple assignments, e.g.:

(df.assign(normalised_value = lambda x: x['value'] / x.groupby('group').transform('sum')['value'],
           normalising_value = lambda x: x.groupby('group').transform('sum')['value'])
  .more_methods...()
)

It'd be nice to have something like:

(df.groupby('group')
  .assign(normalised_value = lambda x: x['value']/x['value'].sum(),
          normalising_value = lambda x: x['value'].sum())
  .ungroup()
  .more_methods...()
)

The R dplyr equivalent being:

df %>%
  group_by(group) %>%
  mutate(normalised_value = value / sum(value)
         normalising_value = sum(value)) %>%
  ungroup() %>%
  more_methods...()
pwwang commented 2 years ago

Looks like some of you are leaning toward R/dplyr styles.

Check out datar, which reimages pandas APIs to align with R/dplyr's.

An example with @s-pike 's R code:

>>> from datar.all import f, tibble, group_by, mutate, ungroup, row_number, sum
[2022-03-17 11:25:33][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>> df = tibble(group=[1,1,2,2], value=[1,2,3,4])
>>> (
...     df
...     >> group_by(f.group)
...     >> mutate(normalised_value=f.value/sum(f.value), normalising_value=sum(f.value))
...     >> ungroup()
...     >> mutate(n=row_number())
... )
    group   value  normalised_value  normalising_value         n
  <int64> <int64>         <float64>            <int64> <float64>
0       1       1          0.333333                  3       1.0
1       1       2          0.666667                  3       2.0
2       2       3          0.428571                  7       3.0
3       2       4          0.571429                  7       4.0
M-Harrington commented 2 years ago

ungroup as a simple wrapper seems like a no brainer. Especially for people new to python that came from R. But in general why would you write pd.concat([grp for g, grp in df.groupby...()]) if there could exist a method as simple as df.ungroup() even if ungroup just calls the prior function. Seems like a simple change to clear up the multiple ways that "ungrouping" can be done and decrease the fatigue of choice.

jreback commented 2 years ago

if u think this is a useful then show a complete example

the above is not very compelling

M-Harrington commented 2 years ago

I'm not sure I understand what you're looking for @jreback. Especially if you're referring to @s-pike 's example. An example of why it might be useful to have a wrapper for ungrouping a dataframe? If you need to recover the original order as is common with unlabeled numpy data for machine learning, having an ordered df by group makes matching the two datasets difficult.

This is a task that happens to me frequently.

jreback commented 2 years ago

a compelling example in code not words

M-Harrington commented 2 years ago

Can you answer my question by any chance? That'll make it easier for me to know what you're looking for.

jreback commented 2 years ago

yes if u have something that could be useful api i need a compelling example the one above is not

M-Harrington commented 2 years ago
#use groupby to create intermediate results (e.g. for data science)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': range(6)})
df = df.groupby('A')
means = df,mean()

# return to methods not defined for groupby
df= df.ungroup()
print(df)
df['constant'] = 3
df.iloc[0,2]

The point isn't that the above can't be used before groupby, it's more of a matter of workflow. Especially in the case of interactively looking through the data, an ungroup option is super useful. Especially because new users will not have the capacity to learn methods such as transform that are defined for grouped objects.

M-Harrington commented 2 years ago

PS because you are being kind of rude in how you're responding to me and the other users: more people in this thread think this would be useful than do not so it would be great if you could explain why you think this isn't useful beyond an appeal to tradition that there's a more "idiomatic" way of doing it.

jreback commented 2 years ago

@M-Harrington it's amazing how these comments just hurt open source maintainers - woa if i actually criticized something.

that said - your example still doesn't explain how ungroup actually adds anything to syntax, clarity or understanding of the code

i was expecting a lot more from someone who teaches

stephenjfox commented 2 years ago

@jreback I may have a decent code example from something that I wrote just the other day: I wanted to combine multiple DataFrameGroupby instances, which happen to be fields of a container object I have (called Dataset), in a sound way that wouldn't lose any information.

Here's a slimmed down version of the code:

def combine_multiple_datasets(backing_dses: List[Dataset]) -> Dataset:
    """A simple wiring together of multiple Datasets into one Dataset that is effectively the children, combined."""
    assert len(backing_dses), "Should have at least one backing dataset"
    ds_instance = Dataset.__new__(Dataset)

    # elided: copy fields (grouping_key, feature_columns, etc.) from children.

    # Combining Groupby's manually
    all_data = [
        (group_name, df)
        for groupby in map(attrgetter('grouped'), backing_dses)
        for group_name, df in groupby
    ]

    ds_instance.grouped = pd.concat([df for _, df in all_data]).groupby(ds_instance.grouping_key)
    return ds_instance

When an ungroup() could facilitate the following:

def combine_multiple_datasets_PREFERED(backing_dses: List[Dataset]) -> Dataset:
    """A simple wiring together of multiple Datasets into one Dataset that is effectively the children, combined."""
    assert len(backing_dses), "Should have at least one backing dataframe"
    ds_instance = Dataset.__new__(Dataset)

    # elided: copy fields (grouping_key, feature_columns, etc.) from children.

    # Combining Groupby's with DataFrameGroupby.ungroup()
    ds_instance.grouped = pd.concat([ds.grouped.ungroup() for ds in backing_dses]).groupby(ds_instance.grouping_key)
    return ds_instance

Also, I don't have an R background. Just do OOP occasionally and want to leverage convenient lower-level abstractions in an elegant way.

M-Harrington commented 1 month ago

jreback, nobody is forcing you to ad hominem. If that's what being part of the open source community means to you, by all means, please stop doing so. No seriously, just don't respond to this issue or this comment. Somebody else will pick it up, or not and then whatever. When you treat the people who use your package poorly, you're not doing anyone a service, either the package, or the people who are trying to use and learn about it.

As @stephenjfox said, we're just asking for something that "leverage[s] convenient lower-level abstractions in an elegant way". Other benefits include: a chance to implement it in a more efficient way than allocating more memory to an object that already exists within the groupby object as df.groupby.obj .