Open gdementen opened 7 years ago
Technically, that should not be too hard (*), but I am unsure about the syntax:
>>> a.partial_agg(sum, 'a1..a3 >> a13')
>>> a.partial_sum('a1..a3 >> a13')
a a0 a13 a4 a5 a6 a7 a8 a9
0 6 4 5 6 7 8 9
>>> a.partial_mean('a1..a3 >> a13')
a a0 a13 a4 a5 a6 a7 a8 a9
0.0 2.0 4.0 5.0 6.0 7.0 8.0 9.0
(*) either create the group explicitly like above, or split the array using LArray.split()
Note that we must also support arbitrary (non-contiguous) groups and (maybe) overlapping groups, which will make an implementation via .split()/.chunks() mostly impossible:
>>> a.partial_sum('a1..a3 >> a13;a6..a8 >> a68')
a a0 a13 a4 a5 a68 a9
0 6 4 5 21 9
>>> a.partial_sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
Now that I think of it, it might be better to implement this as a method on Axis, so that we do not have to define extra aggregate methods and it works out of the box for any aggregate. The difficulty in that case is to find a good name for the method:
>>> arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
a['a1', 'a3', 'a4'] >> 'a134',
a['a2'],
a['a5'],
a['a6', 'a8'] >> 'a68',
a['a7'],
a['a9'])
>>> arr.sum(arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68'))
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
This seems technically interesting, but not very readable/obvious what it means.
I don't think partial
is clear enough. Is partial_grouping
understandable enough?
>>> arr = ndtest(10)
>>> a = arr.a
>>> a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
a['a1', 'a3', 'a4'] >> 'a134',
a['a2'],
a['a5'],
a['a6', 'a8'] >> 'a68',
a['a7'],
a['a9'])
>>> arr.sum(a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68'))
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
maybe Axis.regroup() ?
When groupby is done, we will be able to do this via set_labels + groupby. That would be an improvement compared to the current situation, but maybe not good enough as it is still quite verbose and inefficient.
>>> arr = ndtest(10)
>>> arr.set_labels('a', {'a1': 'a134', 'a3': 'a134', 'a4': 'a134', 'a6': 'a68', 'a8': 'a68'}).groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
Other ideas:
>>> arr = ndtest(10)
>>> # I like this, because it simply generalize what we already have. We might want to implement this regardless of this "partial grouping" feature
>>> arr.set_labels('a', {X.a['a1,a3,a4']: 'a134', X.a['a6,a8']: 'a68'})).groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> # or even reuse an existing group label (this might be going too far?)
>>> arr.set_labels('a', (X.a['a1,a3,a4'] >> 'a134', X.a['a6,a8'] >> 'a68')).groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> # ... but this would be practical
>>> arr.set_labels('a1,a3,a4 >> a134;a6,a8 >> a68').groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> arr.sum('a1,a3,a4 >> a134;a6,a8 >> a68', partial_agg=True) # or "partial" or "keep_other" or ...
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
This gets awkward when we want to combine partial and non partial aggregates.
>>> arr.partial.sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> arr.regroup('a1,a3,a4 >> a134;a6,a8 >> a68').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> groups = arr.a.regroup('a1,a3,a4 >> a134;a6,a8 >> a68')
>>> arr.sum(groups)
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
This is my currently preferred option, but (the way I see it) would benefit from a Grid class LArray.regroup() would return such an object. This Grid thing is more or less implemented in my local branch to implement #635.
given that #635 and the Grid class are slow in coming, we might want to implement Axis.regroup already, which would be very easy to do and would already help our users quite a bit.
Here is some hacky code I did for BM. The goal was to offer an API as close as possible to the future LArray.regroup without depending on the groupby feature:
class RegrouperMethod(object):
def __init__(self, array, name, groups):
self.array = array
self.name = name
if not isinstance(groups, tuple):
groups = (groups,)
groups = tuple(array._prepare_aggregate(name, groups))
assert len(groups) == 1, "regroup only supports groups on one axis so far"
if not isinstance(groups[0], tuple):
groups = (groups,)
new_groups = []
for axis_groups in groups:
axis = axis_groups[0].axis
new_group = []
for l in axis:
lfound = False
for g in axis_groups:
first_elem = g[0] if isinstance(g.key, (tuple, list, np.ndarray, slice)) else g
if l in g:
lfound = True
if l == first_elem:
new_group.append(g)
if not lfound:
new_group.append(l)
new_groups.append(tuple(new_group))
self.groups = tuple(new_groups)
def __call__(self, *args, **kwargs):
args = self.groups + args
return getattr(self.array, self.name)(*args, **kwargs)
class Regrouper(object):
def __init__(self, array, groups):
self.array = array
self.groups = groups
def __getattr__(self, attr):
return RegrouperMethod(self.array, attr, self.groups)
def regroup(array, groups):
return Regrouper(array, groups)
Usage is like this:
>>> arr = ndtest((3, 4))
>>> arr
a\b b0 b1 b2 b3
a0 0 1 2 3
a1 4 5 6 7
a2 8 9 10 11
>>> regroup(arr, 'b1,b3 >> b13').sum()
a\b b0 b13 b2
a0 0 4 2
a1 4 12 6
a2 8 20 10
If we implement the groupby
feature one day, I wonder if the existence of regroup
will not be confusing.
A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand?
Is regroup
will be interesting for other users?
If we implement the groupby feature one day,
It is not an if, it is a when. It is just a matter of me being back on larray code after dc2019 is done.
I wonder if the existence of regroup will not be confusing.
It is always a tradeoff but I think that in this case benefits outweight costs
A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand?
You know the answer to this question: it is obviously no.
Is
regroup
will be interesting for other users?
Yes, it is a very common need, at least in our institution.
I stumbled on the need with a slight variation: amg had to regroup "parts" of some combined axes. I did two different versions to solve her problem. A more limited one but more efficient and a more general but less efficient. The limited one handles only prefixes (aka the first part of the combined axis). The second one works for any "part" of the combined axis but splits the axis, does the aggregate then recombine the axes.
def sum_prefixes(array, axis, prefixes, combined_prefix, sep='_'):
axis = array.axes[axis]
all_prefixes, suffixes = axis.split(sep=sep)
starts_with_prefixes = axis.startingwith(prefixes[0])
for prefix in prefixes[1:]:
starts_with_prefixes = starts_with_prefixes.union(axis.startingwith(prefix))
aggregated_groups = tuple(starts_with_prefixes.endingwith(s) >> f'{combined_prefix}{sep}{s}' for s in suffixes)
other_groups = tuple(axis[:].difference(starts_with_prefixes))
return array.sum(aggregated_groups + other_groups)
def split_axes_sum(array, combined_axis, group, sep='_'):
orig_combined_axis = array.axes[combined_axis]
split_axes = orig_combined_axis.split(sep=sep)
split_array = array.split_axes(combined_axis, sep=sep)
split_axis = split_array.axes[group.axis]
nans = isnan(split_array)
added_labels = nans[nans].axes[combined_axis]
agg_array = split_array.sum((group,) + tuple(split_axis[:].difference(group)))
combined_array = agg_array.combine_axes(split_axes)
new_combined_axis = combined_array.axes[combined_axis]
return combined_array.drop(added_labels.intersection(new_combined_axis))
>>> arr = ndtest('a_b=BR_A,BR_B,WA_B,WA_C,FL_C,FL_D,FR_A,DE_B')
>>> sum_prefixes(arr, 'a_b', ['BR', 'WA', 'FL'], 'BE')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> split_axes_sum(arr, 'a_b', X.a['BR, WA, FL'] >> 'BE')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> split_axes_sum(arr, 'a_b', X.b['B, C'] >> 'BC')
a_b BR_BC BR_A WA_BC FL_BC FL_D FR_BC FR_A DE_BC
1.0 0.0 5.0 4.0 5.0 0.0 6.0 7.0
This could, one day be solved via some kind of pattern syntax, but it's hard to imagine something powerful enough and still readable:
>>> arr.sum('a_b[BR_{prod:*}, WA_{prod:*}, FL_{prod:*}] >> BE_{prod}')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> arr.sum('a_b[(BR|WA|FL)_{prod:*}] >> BE_{prod}')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> arr.sum('a_b[(BR|WA|FL)_*] >> BE_*')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
Implement an easier way to aggregate only part of an axis and leave other labels intact: