PERF: short-circuit (left == right).all() comparisons

jbrockmendel commented 4 years ago

In places like equals methods and array_equivalent, we do things like (left == right).all() or ((left == right) | (isna(left) & isna(right))).all(). For large arrays that are not equal, we can do much better with something like:

def all_match(left, right) -> bool:
    if left.dtype.kind != "i":
        # viewing as i8 will make NaNs be treated as equal
        return _all_match_i8(left.view("i8"), right.view("i8"))

    return _all_match_i8(left, right)

cdef bint _all_match_i8(const int64_t[:] left, const int64_t[:] right):
    cdef:
        Py_ssize_t i, n = len(left)

    for i in range(n):
        if left[i] != right[i]:
            return False

    return True

Some profiling results:

In [2]: arr = np.arange(10**6)                                                                                                                                                      
In [3]: arr2 = arr.copy()                                                                                                                                                           
In [4]: arr2[0] = -1                                                                                                                                                                

In [5]: %timeit np.array_equal(arr, arr2)                                                                                                                                           
831 µs ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit all_match(arr, arr2)                                                                                                                                                
1.27 µs ± 58.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit np.array_equal(arr, arr)                                                                                                                                            
416 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit all_match(arr, arr)                                                                                                                                                 
812 µs ± 5.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So in cases that short circuit early, we can get massive speedups, but this implementation is actually 2x slower in cases that dont short-circuit (for reasons that are not clear to me).

jbrockmendel commented 4 years ago

cc @seberg any idea why this is so much slower in the not-short-circuit case?

seberg commented 4 years ago

@jbrockmendel my guess is: We are using SIMD loops in this case (at least sometimes), and cython probably does not manage to do that.

Are you sure what you are doing is a good idea though? First, -0 and 0 are not different normally. Second, there are many many NaNs, and you are making some of them return True.

jbrockmendel commented 4 years ago

We are using SIMD loops in this case (at least sometimes), and cython probably does not manage to do that.

@scoder thoughts?

scoder commented 4 years ago

SIMD

Might make a difference here, yes. Make sure your CFLAGS allow for auto-vectorisation, and that your C compiler is modern enough to figure it out for you.

It sometimes helps the compiler to make your algorithm redundant, i.e. instead of just a[i] != b[i] make sure the arrays are long enough, take two steps at once and write a[i] != b[i] or a[i+1] != b[i+1], or even using the uglier | instead of or (try to avoid that if you don't need it). That way, the code makes it clear to the C compiler that it can safely read ahead far enough to use wider SIMD instructions.

pandas-dev / pandas

PERF: short-circuit (left == right).all() comparisons #32339