Open harahu opened 2 years ago
I think these are scenarios are possible today without expanding the already large API
Grouping a population dataset by age groups with overlap. Say, ages [0-9, 5-14, 10-19, 15-24, ...].
Can use cut
or generate an IntervalIndex
to define the groups ahead of time before calling groupby
Sliding windows with strides between windows > 1. Including non-constant strides.
You can define custom window boundaries by defining a BaseIndxer
subclass to generate you start/end points for each window
I think these are scenarios are possible today without expanding the already large API
Not saying they aren't possible. I'm claiming they aren't elegantly supported or solved. Python is Turing complete, after all.
Can use
cut
or generate anIntervalIndex
to define the groups ahead of time before callinggroupby
Not sure I understand what you mean here. Isn't this equivalent to, and as memory inefficient as, the explode
-based solution I mentioned? The simplest solution I can think of that doesn't blow up memory requirements is just using a python loop to iterate though my various slices, manually extracting the slices and calling slice.agg
, but that's slow, verbose, and something it feels like there should be a more elegant solution to.
You can define custom window boundaries by defining a
BaseIndxer
subclass to generate you start/end points for each window
Been there, tried that. It's quite neat and almost gets me what I want, except you have to provide bounds for exactly as many windows as there are rows in the df you're windowing into. Say I have some time series at second resolution for a year, and I want to aggregate statistics for 500 periods in that year. That means I still need to specify 31 million entirely meaningless windows I'm not going to use. The alternative is to re-sample and loose accuracy.
Not at all opposed to find a solution using BaseIndexer
, but that will require making it slightly more flexible.
This is certainly an interesting proposal, something I don't think can be efficiently done today. If it can be implemented without a significant performance penalty to other groupby ops and only adds a reasonable amount of code complexity, it seems to me to be worthy of consideration. However, it appears to me to be an uncommon operation and so I don't think it should be pursued without these.
[...] it seems to me to be worthy of consideration.
Appreciate hearing that. I hope I managed to paint a somewhat clear picture of what I am interested in. It's not always easy. I can provide more detail if need be.
However, it appears to me to be an uncommon operation and so I don't think it should be pursued without these.
It might be, or it could be a common need that people get by without being able to cover. Not sure.
My main use case, that I haven't discussed in detail, is doing hierarchical segmentation of time series, in combination with mathematical morphology operations, like dilation, creating hierarchically nested groups, with some cross-hierarchy overlap, where I want statistical summaries at all hierarchy levels. So, essentially, old school image processing, but for (multi-channel) time series.
@mroeschke - I'm pretty convinced this can't be readily done with the API today, so taking off the Needs Info tag, but let me know if you think this is incorrect.
I've been trying to do the same. After hours of wandering through the "groupby" code (groupby
, GroupBy
, Grouper
, Groupings
, etc.), I have to so say that it is extremely (and possibly needlessly) convoluted and, from my viewpoint, non-extensible. The irony is that the whole BaseIndexer
mechanism used for rolling-windows is similar in nature, and very extensible. I'm sure this isn't a priority, but a change (and possibly a redo) in that area of the code, as well as giving an extension mechanism would be a great addition to the project.
Is your feature request related to a problem?
I wish I could use pandas to do more general split-apply-combine workloads. The split-apply-combine pattern is supported today using
groupby
and windowing operations. Whereasgroupby
is limited by the fact that every row in your DataFrame has to be mapped to 1-and-only-1 group, windowing operations are limited by having to have the same number of groups as rows in your DataFrame. This make both approaches unsuited for cases where you want to compute summary statistics for a collection of arbitrary subsets of your data.Here are two examples of situations not covered today:
Describe the solution you'd like
I came across a general solution that I really like in a language/tool called
esProc
. I haven't tried it in practice, and I know very little about the tool's merits, but reading the documentation on what they call enumeration grouping, had me excited.The idea is similar to
groupby
, but instead of specifying a column to group on, you provide a sequence of conditions, where each condition results in a group, that might or might not overlap other groups. This is pretty powerful, an results in a lot of flexibility.Note that this solution doesn't really cover the sliding window "stride inflexibility" problem mentioned above, as you don't want to have to enumerate the masks that would give you your windows, but it does provide a more flexible alternative to
groupby
. I guess striding should be solved by enhancing the existing windowing APIs.API breaking implications
N/A
Describe alternatives you've considered
Add a column to the df, where each cell contains a list, enumerating the group memberships of its corresponding row. Then, using
df.explode
to create a df wheregroupby
can be used as is. This is not satisfactory, as thedf.explode
call, as the method name suggests, risks being highly memory-inefficient, if there is a lot of group overlap.Concatenating the results of multiple group-bys. This is significantly more verbose and might result in arbitrarily many group-by cols.
Additional context
My original motivation for wanting this feature: https://stackoverflow.com/questions/71228107/vectorized-slice-aggregation-in-pandas
The
esProc
implementation (I think):How it could look in
pandas
: