pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.39k stars 17.83k forks source link

ENH: Enumeration grouping #46147

Open harahu opened 2 years ago

harahu commented 2 years ago

Is your feature request related to a problem?

I wish I could use pandas to do more general split-apply-combine workloads. The split-apply-combine pattern is supported today using groupby and windowing operations. Whereas groupby is limited by the fact that every row in your DataFrame has to be mapped to 1-and-only-1 group, windowing operations are limited by having to have the same number of groups as rows in your DataFrame. This make both approaches unsuited for cases where you want to compute summary statistics for a collection of arbitrary subsets of your data.

Here are two examples of situations not covered today:

Describe the solution you'd like

I came across a general solution that I really like in a language/tool called esProc. I haven't tried it in practice, and I know very little about the tool's merits, but reading the documentation on what they call enumeration grouping, had me excited.

The idea is similar to groupby, but instead of specifying a column to group on, you provide a sequence of conditions, where each condition results in a group, that might or might not overlap other groups. This is pretty powerful, an results in a lot of flexibility.

Note that this solution doesn't really cover the sliding window "stride inflexibility" problem mentioned above, as you don't want to have to enumerate the masks that would give you your windows, but it does provide a more flexible alternative to groupby. I guess striding should be solved by enhancing the existing windowing APIs.

API breaking implications

N/A

Describe alternatives you've considered

Additional context

My original motivation for wanting this feature: https://stackoverflow.com/questions/71228107/vectorized-slice-aggregation-in-pandas

The esProc implementation (I think):

How it could look in pandas:


stats = df.enum_group(
    [df.age < 10, 5 <= df.age < 15, 10 <= df.age < 20, 15 <= df.age < 25]
).agg({"height": "mean"})
mroeschke commented 2 years ago

I think these are scenarios are possible today without expanding the already large API

Grouping a population dataset by age groups with overlap. Say, ages [0-9, 5-14, 10-19, 15-24, ...].

Can use cut or generate an IntervalIndex to define the groups ahead of time before calling groupby

Sliding windows with strides between windows > 1. Including non-constant strides.

You can define custom window boundaries by defining a BaseIndxer subclass to generate you start/end points for each window

harahu commented 2 years ago

I think these are scenarios are possible today without expanding the already large API

Not saying they aren't possible. I'm claiming they aren't elegantly supported or solved. Python is Turing complete, after all.

Can use cut or generate an IntervalIndex to define the groups ahead of time before calling groupby

Not sure I understand what you mean here. Isn't this equivalent to, and as memory inefficient as, the explode-based solution I mentioned? The simplest solution I can think of that doesn't blow up memory requirements is just using a python loop to iterate though my various slices, manually extracting the slices and calling slice.agg, but that's slow, verbose, and something it feels like there should be a more elegant solution to.

You can define custom window boundaries by defining a BaseIndxer subclass to generate you start/end points for each window

Been there, tried that. It's quite neat and almost gets me what I want, except you have to provide bounds for exactly as many windows as there are rows in the df you're windowing into. Say I have some time series at second resolution for a year, and I want to aggregate statistics for 500 periods in that year. That means I still need to specify 31 million entirely meaningless windows I'm not going to use. The alternative is to re-sample and loose accuracy.

Not at all opposed to find a solution using BaseIndexer, but that will require making it slightly more flexible.

rhshadrach commented 2 years ago

This is certainly an interesting proposal, something I don't think can be efficiently done today. If it can be implemented without a significant performance penalty to other groupby ops and only adds a reasonable amount of code complexity, it seems to me to be worthy of consideration. However, it appears to me to be an uncommon operation and so I don't think it should be pursued without these.

harahu commented 2 years ago

[...] it seems to me to be worthy of consideration.

Appreciate hearing that. I hope I managed to paint a somewhat clear picture of what I am interested in. It's not always easy. I can provide more detail if need be.

However, it appears to me to be an uncommon operation and so I don't think it should be pursued without these.

It might be, or it could be a common need that people get by without being able to cover. Not sure.

My main use case, that I haven't discussed in detail, is doing hierarchical segmentation of time series, in combination with mathematical morphology operations, like dilation, creating hierarchically nested groups, with some cross-hierarchy overlap, where I want statistical summaries at all hierarchy levels. So, essentially, old school image processing, but for (multi-channel) time series.

rhshadrach commented 2 years ago

@mroeschke - I'm pretty convinced this can't be readily done with the API today, so taking off the Needs Info tag, but let me know if you think this is incorrect.

erezinman commented 1 year ago

I've been trying to do the same. After hours of wandering through the "groupby" code (groupby, GroupBy, Grouper, Groupings, etc.), I have to so say that it is extremely (and possibly needlessly) convoluted and, from my viewpoint, non-extensible. The irony is that the whole BaseIndexer mechanism used for rolling-windows is similar in nature, and very extensible. I'm sure this isn't a priority, but a change (and possibly a redo) in that area of the code, as well as giving an extension mechanism would be a great addition to the project.