larray-project / larray

N-dimensional labelled arrays in Python
https://larray.readthedocs.io/
GNU General Public License v3.0
8 stars 6 forks source link

implement partial aggregates (LArray.regroup and Axis.regroup) #361

Open gdementen opened 7 years ago

gdementen commented 7 years ago

Implement an easier way to aggregate only part of an axis and leave other labels intact:

>>> a = ndtest(10)
>>> a.sum('a0;a1..a3 >> a13;a4;a5;a6;a7;a8;a9')
a  a0  a13  a4  a5  a6  a7  a8  a9
    0    6   4   5   6   7   8   9
gdementen commented 6 years ago

Technically, that should not be too hard (*), but I am unsure about the syntax:

>>> a.partial_agg(sum, 'a1..a3 >> a13')
>>> a.partial_sum('a1..a3 >> a13')
a  a0  a13  a4  a5  a6  a7  a8  a9
    0    6   4   5   6   7   8   9
>>> a.partial_mean('a1..a3 >> a13')
a   a0  a13   a4   a5   a6   a7   a8   a9
   0.0  2.0  4.0  5.0  6.0  7.0  8.0  9.0

(*) either create the group explicitly like above, or split the array using LArray.split()

gdementen commented 6 years ago

Note that we must also support arbitrary (non-contiguous) groups and (maybe) overlapping groups, which will make an implementation via .split()/.chunks() mostly impossible:

>>> a.partial_sum('a1..a3 >> a13;a6..a8 >> a68')
a  a0  a13  a4  a5  a68  a9
    0    6   4   5   21   9
>>> a.partial_sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
gdementen commented 6 years ago

Now that I think of it, it might be better to implement this as a method on Axis, so that we do not have to define extra aggregate methods and it works out of the box for any aggregate. The difficulty in that case is to find a good name for the method:

>>> arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
 a['a1', 'a3', 'a4'] >> 'a134',
 a['a2'],
 a['a5'],
 a['a6', 'a8'] >> 'a68',
 a['a7'],
 a['a9'])
>>> arr.sum(arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68'))
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

This seems technically interesting, but not very readable/obvious what it means.

gdementen commented 6 years ago

I don't think partial is clear enough. Is partial_grouping understandable enough?

>>> arr = ndtest(10)
>>> a = arr.a
>>> a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
 a['a1', 'a3', 'a4'] >> 'a134',
 a['a2'],
 a['a5'],
 a['a6', 'a8'] >> 'a68',
 a['a7'],
 a['a9'])
>>> arr.sum(a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68'))
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
gdementen commented 6 years ago

maybe Axis.regroup() ?

gdementen commented 6 years ago

When groupby is done, we will be able to do this via set_labels + groupby. That would be an improvement compared to the current situation, but maybe not good enough as it is still quite verbose and inefficient.

>>> arr = ndtest(10)
>>> arr.set_labels('a', {'a1': 'a134', 'a3': 'a134', 'a4': 'a134', 'a6': 'a68', 'a8': 'a68'}).groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
gdementen commented 6 years ago

Other ideas:

gdementen commented 5 years ago

given that #635 and the Grid class are slow in coming, we might want to implement Axis.regroup already, which would be very easy to do and would already help our users quite a bit.

gdementen commented 5 years ago

Here is some hacky code I did for BM. The goal was to offer an API as close as possible to the future LArray.regroup without depending on the groupby feature:

class RegrouperMethod(object):
    def __init__(self, array, name, groups):
        self.array = array
        self.name = name
        if not isinstance(groups, tuple):
            groups = (groups,)
        groups = tuple(array._prepare_aggregate(name, groups))
        assert len(groups) == 1, "regroup only supports groups on one axis so far"
        if not isinstance(groups[0], tuple):
            groups = (groups,)
        new_groups = []
        for axis_groups in groups:

            axis = axis_groups[0].axis
            new_group = []
            for l in axis:
                lfound = False
                for g in axis_groups:
                    first_elem = g[0] if isinstance(g.key, (tuple, list, np.ndarray, slice)) else g
                    if l in g:
                        lfound = True
                        if l == first_elem:
                            new_group.append(g)
                if not lfound:
                    new_group.append(l)
            new_groups.append(tuple(new_group))
        self.groups = tuple(new_groups)

    def __call__(self, *args, **kwargs):
        args = self.groups + args
        return getattr(self.array, self.name)(*args, **kwargs)

class Regrouper(object):
    def __init__(self, array, groups):
        self.array = array
        self.groups = groups

    def __getattr__(self, attr):
        return RegrouperMethod(self.array, attr, self.groups)

def regroup(array, groups):
    return Regrouper(array, groups)

Usage is like this:

>>> arr = ndtest((3, 4))
>>> arr
a\b  b0  b1  b2  b3
 a0   0   1   2   3
 a1   4   5   6   7
 a2   8   9  10  11
>>> regroup(arr, 'b1,b3 >> b13').sum()
a\b  b0  b13  b2
 a0   0    4   2
 a1   4   12   6
 a2   8   20  10
alixdamman commented 5 years ago

If we implement the groupby feature one day, I wonder if the existence of regroup will not be confusing. A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand? Is regroup will be interesting for other users?

gdementen commented 5 years ago

If we implement the groupby feature one day,

It is not an if, it is a when. It is just a matter of me being back on larray code after dc2019 is done.

I wonder if the existence of regroup will not be confusing.

It is always a tradeoff but I think that in this case benefits outweight costs

A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand?

You know the answer to this question: it is obviously no.

Is regroup will be interesting for other users?

Yes, it is a very common need, at least in our institution.

gdementen commented 1 year ago

I stumbled on the need with a slight variation: amg had to regroup "parts" of some combined axes. I did two different versions to solve her problem. A more limited one but more efficient and a more general but less efficient. The limited one handles only prefixes (aka the first part of the combined axis). The second one works for any "part" of the combined axis but splits the axis, does the aggregate then recombine the axes.

def sum_prefixes(array, axis, prefixes, combined_prefix, sep='_'):
    axis = array.axes[axis]
    all_prefixes, suffixes = axis.split(sep=sep)    
    starts_with_prefixes = axis.startingwith(prefixes[0])
    for prefix in prefixes[1:]:
        starts_with_prefixes = starts_with_prefixes.union(axis.startingwith(prefix))
    aggregated_groups = tuple(starts_with_prefixes.endingwith(s) >> f'{combined_prefix}{sep}{s}' for s in suffixes)
    other_groups = tuple(axis[:].difference(starts_with_prefixes))
    return array.sum(aggregated_groups + other_groups)

def split_axes_sum(array, combined_axis, group, sep='_'):
    orig_combined_axis = array.axes[combined_axis]
    split_axes = orig_combined_axis.split(sep=sep)
    split_array = array.split_axes(combined_axis, sep=sep)
    split_axis = split_array.axes[group.axis]
    nans = isnan(split_array)
    added_labels = nans[nans].axes[combined_axis]
    agg_array = split_array.sum((group,) + tuple(split_axis[:].difference(group)))
    combined_array = agg_array.combine_axes(split_axes)
    new_combined_axis = combined_array.axes[combined_axis]
    return combined_array.drop(added_labels.intersection(new_combined_axis))

>>> arr = ndtest('a_b=BR_A,BR_B,WA_B,WA_C,FL_C,FL_D,FR_A,DE_B')
>>> sum_prefixes(arr, 'a_b', ['BR', 'WA', 'FL'], 'BE')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> split_axes_sum(arr, 'a_b', X.a['BR, WA, FL'] >> 'BE')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> split_axes_sum(arr, 'a_b', X.b['B, C'] >> 'BC')
a_b  BR_BC  BR_A  WA_BC  FL_BC  FL_D  FR_BC  FR_A  DE_BC
       1.0   0.0    5.0    4.0   5.0    0.0   6.0    7.0

This could, one day be solved via some kind of pattern syntax, but it's hard to imagine something powerful enough and still readable:

>>> arr.sum('a_b[BR_{prod:*}, WA_{prod:*}, FL_{prod:*}] >> BE_{prod}')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> arr.sum('a_b[(BR|WA|FL)_{prod:*}] >> BE_{prod}')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> arr.sum('a_b[(BR|WA|FL)_*] >> BE_*')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0