Speedups for jitted reductions

ml31415 commented 8 years ago

I was playing around with code like this:

import numpy as np
import numba as nb
import bottleneck as bn
import numbagg as nbg

@nb.njit
def nanmin_demo(x):
    if x.size == 0:
        raise ValueError("nanmin(): empty array")
    ret = np.nan
    for i, x_ in enumerate(x.flat):
        if not np.isnan(x_):
            ret = x_
            break
    if np.isnan(ret):
        return ret

    # This should be x.flat[i:], so that the array is
    # not iterated again unnecessarily. Better ideas?
    for x_ in x.flat:
        if x_ < ret:
            ret = x_
    return ret

@nb.njit
def nanmin_numba(a):
    if a.size == 0:
        raise ValueError("nanmin(): empty array")
    for view in np.nditer(a):
        minval = view.item()
        break
    for view in np.nditer(a):
        v = view.item()
        if not minval < v and not np.isnan(v):
            minval = v
    return minval

@nb.njit
def nanmin_numbagg_1dim(a):
    amin = np.infty
    all_missing = 1
    for ai in a.flat:
        if ai <= amin:
            amin = ai
            all_missing = 0
    if all_missing:
        amin = np.nan
    return amin

impls = [nanmin_demo, nanmin_numba, nanmin_numbagg_1dim, nbg.nanmin, bn.nanmin]

for i in range(4):
    x = np.random.random(100000)
    x[x>0.3334*i] = np.nan
    res = np.array([impl(x) for impl in impls])
    assert np.all(res[0] == res) or np.all(np.isnan(res))
    x.reshape((-1, 100))
    res = np.array([impl(x) for impl in impls])
    assert np.all(res[0] == res) or np.all(np.isnan(res))
    for impl in impls:
        %timeit impl(x)
    print '--------'
print nb.__version__

nanmin_impl is the currently imlemented overload. It's compared with the inbuilt numpy version, bottleneck and another experimental implementation. The timing looks like this for me:

10000 loops, best of 3: 63.5 µs per loop
10000 loops, best of 3: 154 µs per loop
10000 loops, best of 3: 62.8 µs per loop
10000 loops, best of 3: 151 µs per loop
10000 loops, best of 3: 68.4 µs per loop
--------
10000 loops, best of 3: 50.9 µs per loop
1000 loops, best of 3: 405 µs per loop
10000 loops, best of 3: 57.3 µs per loop
10000 loops, best of 3: 151 µs per loop
10000 loops, best of 3: 68.4 µs per loop
--------
10000 loops, best of 3: 51.1 µs per loop
1000 loops, best of 3: 409 µs per loop
10000 loops, best of 3: 57.4 µs per loop
10000 loops, best of 3: 151 µs per loop
10000 loops, best of 3: 68.5 µs per loop
--------
10000 loops, best of 3: 51.6 µs per loop
10000 loops, best of 3: 126 µs per loop
10000 loops, best of 3: 57.5 µs per loop
10000 loops, best of 3: 151 µs per loop
10000 loops, best of 3: 68.5 µs per loop
--------
0.28.1

10000 loops, best of 3: 59.4 µs per loop
10000 loops, best of 3: 153 µs per loop
10000 loops, best of 3: 192 µs per loop
1000 loops, best of 3: 207 µs per loop
10000 loops, best of 3: 68.5 µs per loop
--------
10000 loops, best of 3: 102 µs per loop
1000 loops, best of 3: 406 µs per loop
10000 loops, best of 3: 192 µs per loop
1000 loops, best of 3: 207 µs per loop
10000 loops, best of 3: 68.4 µs per loop
--------
10000 loops, best of 3: 102 µs per loop
1000 loops, best of 3: 408 µs per loop
10000 loops, best of 3: 192 µs per loop
1000 loops, best of 3: 208 µs per loop
10000 loops, best of 3: 68.4 µs per loop
--------
10000 loops, best of 3: 102 µs per loop
10000 loops, best of 3: 126 µs per loop
10000 loops, best of 3: 192 µs per loop
1000 loops, best of 3: 206 µs per loop
10000 loops, best of 3: 68.6 µs per loop
--------
0.29.0

As you can see, for some unlucky cases, the current implementation is 8x slower than nanmin_demo. About half of this speedup comes from using x.flat instead of nd.iter. I'm actually not sure, if there are some good reasons to use nd.iter instead of x.flat? If so, I'd be curious to learn, cause it would also affect a bunch of code that I wrote for numpy_groupies.

The other major speedup seems to be some unlucky branching being done when if is fed with more than a simple nan-check or comparison. The jitted code seems to suffer much more from adding compound if-statements than ordinary C code. Any ideas why that is?

If it should actually be valid, to use the flatiter for these cases, I'd go ahead and put some more optimizations together for a pull request.

gdementen commented 7 years ago

You might want to have a look at https://github.com/shoyer/numbagg

pitrou commented 7 years ago

Well, nanmin_impl() doesn't use the same algorithm as nanmin(), so you may be comparing apples to oranges here.