A+B, axpy - Githubissues

nschloe commented 4 years ago

The first google hit on "numpy parallel" is this document where it says:

Finally, scipy/numpy does not parallelize operations like
>>> A = B + C
>>> A = numpy.sin(B)
>>> A = scipy.stats.norm.isf(B)
These operations run sequentially, taking no advantage of multicore machines (but see below). In > principle, this could be changed without too much work.

I'm particularly interested in parallelizing simple arithmethic operations like A+B, A+=B, A*B etc. Im looking at BLAS's axpy of course.

Has any work been done in this area yet? At which part of the code would I have to look?

rgommers commented 4 years ago

I believe the Intel modifications to NumPy, using VML and OpenMP, that they're shipping with the Intel Python distributed are doing some of this.

Note that this is not the best idea to do without runtime control, as hinted at by one of the next sentences from the paragraph you quoted:

Of course, in reality one would want to have some runtime control

By default, library code should always run single-threaded. The OpenBLAS and MKL multithreading commonly mess things up when users want to use multiprocessing or another level of multithreading (e.g. with Dask, or with n_jobs in scikit-learn).

So I think there's not much point in doing work on this without thinking about an API for how to give users control over the behavior. OpenMP is pretty terrible, it doesn't really make sense unless you build a whole stack/application from source for your own use, which is not the case when you distribute a library as wheels or conda packages. Things like n_jobs=1 in scikit-learn and workers=1 in SciPy are nice, but it's hard to give + a keyword ....

mattip commented 4 years ago

The problem with this approach is that it leads to poorly optimized code: it is too low level. The amount of churn produced by fanning simple operations out to multiple cores, bringing the result back to a single point, then fanning out the next operation is quite hard to manage sometimes leading to actual regressions in performance, as @rgommers points out.

Rather than spending time and effort to optimize primitives via multicore calls, you might want to put effort into looking at higher level graphs of functions and optimize those, perhaps using multicore methods but also by eliminating temporaries, fusing loops, lazy evaluation, and a host of other techniques. This is the approach taken in projects like Numba, PyPy, Dask, numexpr (those just in the python scientific data stack world), pythran and xtensor in the C++ world, and things like javascript JITs.

mattip commented 4 years ago

Perhaps the warning at the top of those archived pages should be larger :smile:

rgommers commented 4 years ago

Nice comment @mattip, thanks. I think if the question is "how do we improve performance of NumPy code", then there's still some things to do within NumPy (like the ongoing SIMD work, reducing overhead of ufuncs further, etc.) to make single-threaded performance better. And then there's a lot of things that can be done on other tools. I'd personally like to see Numba, Pythran and Transonic get more support, so we can use Transonic as the API and switch between JIT and ahead-of-time compilation.

nschloe commented 4 years ago

Thanks everyone for the comments!

I understand that you don't want to mess too much with seemingly simple things like that. All the arguments apply for numpy.dot as well though where we do use BLAS (and parallelization). I haven't seen any performance tests, but it must pay off there, right?

Would it perhaps be reasonable to add an .axpy() method (and perhaps others) to numpy arrays? That'd give the user full explicit control of when BLAS is used. I'd be interested of how that actually compares to x += a * y.

eric-wieser commented 4 years ago

Would it perhaps be reasonable to add an .axpy() method (and perhaps others) to numpy arrays?

This sounds like something that might be better explored as free functions / ufuncs in a new numpy_parallel module

nschloe commented 4 years ago

Okay, so I noticed that scipy does expose axpy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.blas.daxpy.html). A speed comparison shows that axpy is never faster than the intrinsic x += a * y:

axpy

Code to reproduce the plot:

import numpy as np
import scipy.linalg
import perfplot

b = 5.0

def intrinsics(data):
    x, y = data
    return y + b * x

def axpy(data):
    x, y = data
    yy = y.copy()
    n = x.shape[0]
    axpy = scipy.linalg.blas.get_blas_funcs('axpy', arrays=(x, yy))
    axpy(x, yy, n, b)
    return yy

perfplot.show(
    setup=lambda n: np.random.rand(2, n),
    kernels=[intrinsics, axpy],
    n_range=[2**k for k in range(25)],
    logx=True,
    logy=True,
)

So, never mind, I guess. :smile:

rgommers commented 4 years ago

All the arguments apply for numpy.dot as well though where we do use BLAS (and parallelization)

It's more like "we don't have control over that" than that we intended it to work that way. If we could make the whole OpenMP mess that powers that go away, I'd be all for that.

So, never mind, I guess. 😄

Thanks for sharing those results! Closing this issue then.

chillenb commented 5 months ago

The benchmark above is using arrays that are not large enough to overcome the overhead in perfplot. You can see the performance difference if you use many inner loops.

import numpy as np
import scipy.linalg
import perfplot
from threadpoolctl import threadpool_limits

b = 0.01

nloop = 500

def intrinsics(data):
    x, y = data
    for _ in range(nloop):
        y = y + b * x
    return y

def axpy(data):
    x, y = data
    n = x.shape[0]
    axpy = scipy.linalg.blas.daxpy
    for _ in range(nloop):
        axpy(x, y, n, b)
    return y

with threadpool_limits(limits=8, user_api='blas'):
    perfplot.save(
        "daxpybench.png",
        setup=lambda n: ((np.random.random(n), np.random.random(n)),),
        kernels=[intrinsics, axpy],
        n_range=[2**k for k in range(10, 25)],
        target_time_per_measurement=2.0,
        logx=True,
        logy=True,
        equality_check=None
    )

This was taken on a laptop with an 8-core Zen2 cpu with AVX2 (openblas). The gap will be wider on a server, especially if you have an intel CPU + MKL. You cannot saturate the memory bandwidth of most manycore processors with a single thread anymore.

rgommers commented 5 months ago

Thanks for improving the benchmark @chillenb. You are completely right. In general, fused operations should be faster under the right circumstances.

numpy / numpy

A+B, axpy #15056