ml31415 / numpy-groupies

Optimised tools for group-indexing operations: aggregated sum and more
BSD 2-Clause "Simplified" License
194 stars 20 forks source link
groupby numba numpy python

GitHub Workflow CI Status PyPI Conda-forge Python Version from PEP 621 TOML PyPI - Downloads

numpy-groupies

This package consists of a small library of optimised tools for doing things that can roughly be considered "group-indexing operations". The most prominent tool is aggregate, which is described in detail further down the page.

Installation

If you have pip, then simply:

pip install numpy_groupies

Note that numpy_groupies doesn't have any compulsory dependencies (even numpy is optional) so you should be able to install it fairly easily even without a package manager. If you just want one particular implementation of aggregate (e.g. aggregate_numpy.py), you can download that one file, and copy-paste the contents of utils.py into the top of that file (replacing the from .utils import (...) line).

aggregate

aggregate_diagram

import numpy as np
import numpy_groupies as npg
group_idx = np.array([   3,   0,   0,   1,   0,   3,   5,   5,   0,   4])
a =         np.array([13.2, 3.5, 3.5,-8.2, 3.0,13.4,99.2,-7.1, 0.0,53.7])
npg.aggregate(group_idx, a, func='sum', fill_value=0)
# >>>          array([10.0, -8.2, 0.0, 26.6, 53.7, 92.1])

aggregate takes an array of values, and an array giving the group number for each of those values. It then returns the sum (or mean, or std, or any, ...etc.) of the values in each group. You have probably come across this idea before - see Matlab's accumarray function, or pandas groupby concept, or MapReduce paradigm, or simply the basic histogram.

A couple of implemented functions do not reduce the data, instead it calculates values cumulatively while iterating over the data or permutates them. The output size matches the input size.

group_idx = np.array([4, 3, 3, 4, 4, 1, 1, 1, 7, 8, 7, 4, 3, 3, 1, 1])
a =         np.array([3, 4, 1, 3, 9, 9, 6, 7, 7, 0, 8, 2, 1, 8, 9, 8])
npg.aggregate(group_idx, a, func='cumsum')
# >>>          array([3, 4, 5, 6,15, 9,15,22, 7, 0,15,17, 6,14,31,39])

Inputs

The function accepts various different combinations of inputs, producing various different shapes of output. We give a brief description of the general meaning of the inputs and then go over the different combinations in more detail:

aggregate_dims_diagram

Note on performance. The order of the output is unlikely to affect performance of aggregate (although it may affect your downstream usage of that output), however the order of multidimensional a or group_idx can affect performance: in Form 4 it is best if columns are contiguous in memory within group_idx, i.e. group_idx[:, 99] corresponds to a contiguous chunk of memory; in Form 3 it's best if all the data in a for group_idx[i] is contiguous, e.g. if axis=1 then we want a[:, 55] to be contiguous.

Available functions

By default, aggregate assumes you want to sum the values within each group, however you can specify another function using the func kwarg. This func can be any custom callable, however you will likely want one of the following optimized functions. Note that not all functions might be provided by all implementations.

The above functions also have a nan-form, which skip the nan values instead of propagating them to the result of the calculation:

The following functions are slightly different in that they always return boolean values. Their treatment of nans is also different from above:

The following functions don't reduce the data, but instead produce an output matching the size of the input:

Finally, there are three functions which don't reduce each group to a single value, instead they return the full set of items within the group:

Examples

Compute sums of consecutive integers, and then compute products of those consecutive integers.

group_idx = np.arange(5).repeat(3)
# group_idx: array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])
a = np.arange(group_idx.size)
# a: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
x = npg.aggregate(group_idx, a) # sum is default
# x: array([ 3, 12, 21, 30, 39])
x = npg.aggregate(group_idx, a, 'prod')
# x: array([ 0, 60, 336, 990, 2184])

Get variance ignoring nans, setting all-nan groups to nan.

x = npg.aggregate(group_idx, a, func='nanvar', fill_value=nan)

Count the number of elements in each group. Note that this is equivalent to doing np.bincount(group_idx), indeed that is how the numpy implementation does it.

x = npg.aggregate(group_idx, 1)

Sum 1000 values into a three-dimensional cube of size 15x15x15. Note that in this example all three dimensions have the same size, but that doesn't have to be the case.

group_idx = np.random.randint(0, 15, size=(3, 1000))
a = np.random.random(group_idx.shape[1])
x = npg.aggregate(group_idx, a, func="sum", size=(15,15,15), order="F")
# x.shape: (15, 15, 15)
# np.isfortran(x): True

Use a custom function to generate some strings.

group_idx = np.array([1, 0,  1,  4,  1])
a = np.array([12.0, 3.2, -15, 88, 12.9])
x = npg.aggregate(group_idx, a,
              func=lambda g: ' or maybe '.join(str(gg) for gg in g), fill_value='')
# x: ['3.2', '12.0 or maybe -15.0 or maybe 12.9', '', '', '88.0']

Use the axis arg in order to do a sum-aggregation on three rows simultaneously.

a = np.array([[99, 2,  11, 14,  20],
           [33, 76, 12, 100, 71],
           [67, 10, -8, 1,   9]])
group_idx = np.array([[3, 3, 7, 0, 0]])
x = npg.aggregate(group_idx, a, axis=1)
# x : [[ 34, 0, 0, 101, 0, 0, 0, 11],
#      [171, 0, 0, 109, 0, 0, 0, 12],
#      [ 10, 0, 0,  77, 0, 0, 0, -8]]

Multiple implementations

There are multiple implementations of aggregate provided. If you use from numpy_groupies import aggregate, the best available implementation will automatically be selected. Otherwise you can pick a specific version directly like from numpy_groupies import aggregate_nb as aggregate or by importing aggregate from the implementing module from numpy_groupies.aggregate_weave import aggregate.

Currently the following implementations exist:

All implementations have the same calling syntax and produce the same outputs, to within some floating-point error. However some implementations only support a subset of the valid inputs and will sometimes throw NotImplementedError.

Benchmarks

Scripts for testing and benchmarking are included in this repository. For benchmarking, run python -m numpy_groupies.benchmarks.generic from the root of this repository.

Below we are using 500,000 indices uniformly picked from [0, 1000). The values of a are uniformly picked from the interval [0,1), with anything less than 0.2 then set to 0 (in order to serve as falsy values in boolean operations). For nan- operations another 20% of the values are set to nan, leaving the remainder on the interval [0.2,0.8).

The benchmarking results are given in ms for an i7-7560U running at 2.40GHz:

function ufunc numpy numba pandas
sum 1.950 1.728 0.708 11.832
prod 2.279 2.349 0.709 11.649
min 2.472 2.489 0.716 11.686
max 2.457 2.480 0.745 11.598
len 1.481 1.270 0.635 10.932
all 37.186 3.054 0.892 12.587
any 35.278 5.157 0.890 12.845
anynan 5.783 2.126 0.762 144.740
allnan 7.971 4.367 0.774 144.507
mean ---- 2.500 0.825 13.284
std ---- 4.528 0.965 12.193
var ---- 4.269 0.969 12.657
first ---- 1.847 0.811 11.584
last ---- 1.309 0.581 11.842
argmax ---- 3.504 1.411 293.640
argmin ---- 6.996 1.347 290.977
nansum ---- 5.388 1.569 15.239
nanprod ---- 5.707 1.546 15.004
nanmin ---- 5.831 1.700 14.292
nanmax ---- 5.847 1.731 14.927
nanlen ---- 3.170 1.529 14.529
nanall ---- 6.499 1.640 15.931
nanany ---- 8.041 1.656 15.839
nanmean ---- 5.636 1.583 15.185
nanvar ---- 7.514 1.682 15.643
nanstd ---- 7.292 1.666 15.104
nanfirst ---- 5.318 2.096 14.432
nanlast ---- 4.943 1.473 14.637
nanargmin ---- 7.977 1.779 298.911
nanargmax ---- 5.869 1.802 301.022
cumsum ---- 71.713 1.119 8.864
cumprod ---- ---- 1.123 12.100
cummax ---- ---- 1.062 12.133
cummin ---- ---- 0.973 11.908
arbitrary ---- 147.853 46.690 129.779
sort ---- 167.699 ---- ----

_Linux(x8664), Python 3.10.12, Numpy 1.25.2, Numba 0.58.0, Pandas 2.0.2

Development

This project was started by @ml31415 and the numba and weave implementations are by him. The pure python and numpy implementations were written by @d1manson.

The authors hope that numpy's ufunc.at methods or some other implementation of aggregate within numpy or scipy will eventually be fast enough, to make this package redundant. Numpy 1.25 actually contained major improvements on ufunc speed, which reduced the speed gap between numpy and the numba implementation a lot.