pydata / bottleneck

Fast NumPy array functions written in C
BSD 2-Clause "Simplified" License
1.05k stars 101 forks source link

Improved slow sum-of-squares #189

Open corrado9999 opened 6 years ago

corrado9999 commented 6 years ago

It looks like a job for np.einsum

def ss(a, axis=None):
    "Slow sum of squares used for unaccelerated dtypes."
    a = np.asarray(a)
    input_subscripts = range(a.ndim)
    output_subscripts = list(set(input_subscripts) - set([axis])) \
                        if axis is not None \
                        else []
    y = np.einsum(a, input_subscripts, a, input_subscripts, output_subscripts)
    return y

This does not handle NaNs, but they are not handled by the current version either. Actually, I have to admit that, on my machine, this version is twice faster than the accelerated version on 1000x1000 arrays, but I do not know how much general this result can be.

kwgoodman commented 6 years ago

Wow! That is amazing.

The current version is just:

def ss(a, axis=None):
    "Slow sum of squares used for unaccelerated dtypes."
    a = np.asarray(a)
    y = np.multiply(a, a).sum(axis)
    return y

The slow version serves several purposes: unit testing (I prefer the current simple one for that); benchmarking (I assume, but could be wrong, that most users would do something like a*a as in the current version); handle unsupported dtypes (rare).

The bottleneck version is much faster for small arrays. For large arrays I wonder if there is C API access to einsum (not that I would code it up). At the least I should add in the docstring that np.einsum is much faster for large 2d arrays.

kwgoodman commented 6 years ago

Alternatively we could use your version. It currently doesn't handle negative axes.

kwgoodman commented 6 years ago

This seems to work (passes all bottleneck unit tests):


def ss(a, axis=None):
    "Slow sum of squares used for unaccelerated dtypes."
    a = np.asarray(a)
    input_subscripts = range(a.ndim)
    if axis is None:
        output_subscripts = []
    else:
        if axis < 0:
            axis = a.ndim + axis
            if axis < 0:
                raise ValueError("`axis` is out of range")
        output_subscripts = list(set(input_subscripts) - set([axis]))
    y = np.einsum(a, input_subscripts, a, input_subscripts, output_subscripts)
    return y
corrado9999 commented 6 years ago

Great! What about nansum? We have to handle NaNs by copying the array.

def nansum(a, axis=None):
    "Slow nansum function used for unaccelerated dtype."
    a = np.asarray(a)
    mask = np.isnan(a)
    if mask.any():
        a = a.copy()
        a[mask] = 0
    input_subscripts = range(a.ndim)
    if axis is None:
        y = np.einsum(a, input_subscripts)
    else:
        if axis < 0:
            axis = a.ndim + axis
            if axis < 0:
                raise ValueError("`axis` is out of range")
        output_subscripts = list(set(input_subscripts) - set([axis]))
        y = np.einsum(a, input_subscripts, output_subscripts)
    return y

Please, note how I managed differently the "raveled" case. This is because I noted a severe performance drop when passing an empty list as output subscripts. Here are some timings on my machine:

In [1]: import bottleneck, numpy as np

In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:def nansum(a, axis=None):
:    "Slow nansum function used for unaccelerated dtype."
:    a = np.asarray(a)
:    mask = np.isnan(a)
:    if mask.any():
:        a = a.copy()
:        a[mask] = 0
:    input_subscripts = range(a.ndim)
:    if axis is None:
:        y = np.einsum(a, input_subscripts)
:    else:
:        if axis < 0:
:            axis = a.ndim + axis
:            if axis < 0:
:                raise ValueError("`axis` is out of range")
:        output_subscripts = list(set(input_subscripts) - set([axis]))
:        y = np.einsum(a, input_subscripts, output_subscripts)
:    return y
:<EOF>

In [3]: a = np.random.rand(1000,1000)

In [4]: %timeit nansum(a)
1000 loops, best of 3: 460 µs per loop

In [5]: %timeit nansum(a, 0)
1000 loops, best of 3: 863 µs per loop

In [6]: %timeit nansum(a, 1)
1000 loops, best of 3: 890 µs per loop

In [7]: %timeit bottleneck.nansum(a)
1000 loops, best of 3: 876 µs per loop

In [8]: %timeit bottleneck.nansum(a, 0)
1000 loops, best of 3: 1.1 ms per loop

In [9]: %timeit bottleneck.nansum(a, 1)
1000 loops, best of 3: 882 µs per loop

In [10]: a[30,40] = np.nan

In [11]: %timeit nansum(a)
1000 loops, best of 3: 1.32 ms per loop

In [12]: %timeit nansum(a, 0)
100 loops, best of 3: 1.91 ms per loop

In [13]: %timeit nansum(a, 1)
100 loops, best of 3: 1.83 ms per loop

In [14]: %timeit bottleneck.nansum(a)
1000 loops, best of 3: 875 µs per loop

In [15]: %timeit bottleneck.nansum(a, 0)
1000 loops, best of 3: 1.1 ms per loop

In [16]: %timeit bottleneck.nansum(a, 1)
1000 loops, best of 3: 883 µs per loop
kwgoodman commented 6 years ago

I think the case for changing bn.slow.nansum is weaker than that for bn.slow.ss. For example, many users of nansum will know ahead of time if the array contains nans.

kwgoodman commented 6 years ago

How does the speed of your einsum-based nansum compare to bn.slow.nansum which is just np.nansum?