Performance evaluation - Githubissues

jbusecke commented 3 years ago

JOSS askes for If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

What should/can we do to document performance of the package?

gjoseph92 commented 3 years ago

Have you tried @njit(nogil=True)?

Also, I'd be curious how @njit(parallel=True) and @njit(parallel=False) compare under plain serial %timeit.

rabernat commented 3 years ago

Have you tried @njit(nogil=True)?

This was probably the trick I was looking for! With @njit(nogil=True) on all the functions above, I get:

2.77 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 1.33X speedup across 4 threads
5.14 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.02 s
Total parallel time: 0.01 s
For a 1.49X speedup across 4 threads
7.11 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.03 s
Total parallel time: 0.01 s
For a 3.35X speedup across 4 threads

Adding parallel=True just worked with the stencil functions. For regular_laplacian_numba_jit, I replaced the outer j loop with prange, and got the following timings

1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 0.90X speedup across 4 threads
4.46 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.02 s
Total parallel time: 0.02 s
For a 0.85X speedup across 4 threads
4.01 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 1.28X speedup across 4 threads

So parallel does seem to speed some things up, but at the expense of other layers of parallel scaling.

Any advice on where our parallelism would best be spent? We could either be using dask or numba to achieve single-machine parallel scaling. Using both would probably not be the right choice. Is there a best practice here?

mrocklin commented 3 years ago

Whatever works :)

On Mon, Jul 26, 2021 at 8:52 AM Ryan Abernathey @.***> wrote:

Have you tried @njit(nogil=True)?

This was probably the trick I was looking for! With @njit(nogil=True) on all the functions above, I get:

2.77 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Total serial time: 0.01 s

Total parallel time: 0.01 s

For a 1.33X speedup across 4 threads

5.14 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Total serial time: 0.02 s

Total parallel time: 0.01 s

For a 1.49X speedup across 4 threads

7.11 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Total serial time: 0.03 s

Total parallel time: 0.01 s

For a 3.35X speedup across 4 threads

Adding parallel=True just worked with the stencil functions. For regular_laplacian_numba_jit, I replaced the outer j loop with prange, and got the following timings

1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Total serial time: 0.01 s

Total parallel time: 0.01 s

For a 0.90X speedup across 4 threads

4.46 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Total serial time: 0.02 s

Total parallel time: 0.02 s

For a 0.85X speedup across 4 threads

4.01 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Total serial time: 0.01 s

Total parallel time: 0.01 s

For a 1.28X speedup across 4 threads

So parallel does seem to speed some things up, but at the expense of other layers of parallel scaling.

Any advice on where our parallelism would best be spent? We could either be using dask or numba to achieve single-machine parallel scaling. Using both would probably not be the right choice. Is there a best practice here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-886821344, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTEIA2C6QOIWY766BDDTZWAC7ANCNFSM44FKBUCA .

gjoseph92 commented 3 years ago

This was probably the trick I was looking for!

Great to confirm our theory that it was indeed the GIL. "3.35X speedup across 4 threads" sounds like the performance we were hoping for.

To note for the future, if you don't want to deal with perf to run giltracer, I like py-spy top -- python my_script.py and watching the GIL vs active percentages as a quick way to get a sense of the impact the GIL is having using py-spy. For example I modified your script from https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-884330141 and replaced the %timeits with

with concurrent.futures.ThreadPoolExecutor() as exc:
    for i in range(5000):
        fs = [exc.submit(regular_laplacian_numba_jit, data) for _ in range(4)]
        concurrent.futures.wait(fs)

and ran sudo py-spy top -- python test.py. Without nogil=True I'd see something like GIL: 100.00%, Active: 107.00%, Threads: 5. With nogil=True I'd see GIL: 23.00%, Active: 314.00%, Threads: 5. The GIL: 100.00% gives a strong indication that the GIL is the problem (and indeed, it's still probably the thing keeping us from a 4x speedup across 4 threads), so knowing that you might invest the time to do more detailed profiling with giltracer.

So parallel does seem to speed some things up, but at the expense of other layers of parallel scaling.

This is what we'd expect. I wasn't thinking "we should parallelize parallel=True"; I wanted to look specifically at the performance of Numba's lower-level parallelism vs the naive parallelism of running the same serial Numba operation in 4 threads at once. I was curious if Numba was able to be smarter about parallelism (particularly memory access) in some way.

gjoseph92 commented 3 years ago

We could either be using dask or numba to achieve single-machine parallel scaling.

I've tested this locally on macOS on intel, and Numba's parallelism seems ~50% faster than naive parallelism. However, you should test on a setup representative of what you'll actually run this on. There are many small details that affect this.

Is there a best practice here?

When it doesn't make a performance difference, I'd recommend fewer knobs to turn. So letting dask be the sole concurrency layer is simpler to reason about than layering a dask threadpool and a Numba threadpool.

However, Numba may be faster, so the complexity is probably worth it.

First a sidenote: I believe the above code doesn't actually handle boundary conditions correctly. For example, if you change shape to 1025, it will segfault on my machine. At the last row, a[j + 1, i] will be an out-of-bounds index. I have a version below that correctly wraps around at the edges. What you want is basically a[(j - 1) % ny, i], but using a conditional is a bit faster in my tests, since % is a relatively expensive CPU operation.

Testing code

```python import numpy as np from numba import njit, prange, set_num_threads import concurrent.futures from time import perf_counter @njit(nogil=True, cache=False) def regular_laplacian_numba_jit_serial(a, out=None): # Does handle boundaries correctly ny, nx = a.shape last_j, last_i = ny - 1, nx - 1 out = np.empty_like(a) if out is None else out for j in range(ny): for i in range(nx): out[j, i] = ( -4 * a[j, i] + a[j - 1 if j != 0 else last_j, i] + a[j + 1 if j != last_j else 0, i] + a[j, i + 1 if i != last_i else 0] + a[j, i - 1 if i != 0 else last_i] ) return out @njit(nogil=True, parallel=True, cache=False) def regular_laplacian_numba_jit_parallel(a, out=None): # Does handle boundaries correctly ny, nx = a.shape last_j, last_i = ny - 1, nx - 1 out = np.empty_like(a) if out is None else out for j in prange(ny): for i in range(nx): out[j, i] = ( -4 * a[j, i] + a[j - 1 if j != 0 else last_j, i] + a[j + 1 if j != last_j else 0, i] + a[j, i + 1 if i != last_i else 0] + a[j, i - 1 if i != 0 else last_i] ) return out shape = (1024, 1024) data = np.random.rand(*shape) out = np.empty_like(data) # call everything once to trigger compilation # verify results identical outside the boundary np.testing.assert_allclose( regular_laplacian_numba_jit_serial(data, out=out), regular_laplacian_numba_jit_parallel(data, out=out), ) n_repeats = 1000 concurrency = 4 n_ops = n_repeats * concurrency data_copies = [data.copy() for _ in range(concurrency)] # Naive parallelism, each thread operating on a separate copy with concurrent.futures.ThreadPoolExecutor() as exc: start = perf_counter() for _ in range(n_repeats): fs = [ exc.submit(regular_laplacian_numba_jit_serial, data, out=out) for data in data_copies ] concurrent.futures.wait(fs) elapsed = perf_counter() - start print( f"Serial in {concurrency} threads, {concurrency} data copies: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec" ) # Naive parallelism, all threads on the same copy with concurrent.futures.ThreadPoolExecutor() as exc: start = perf_counter() for _ in range(n_repeats): fs = [ exc.submit(regular_laplacian_numba_jit_serial, data, out=out) for _ in range(concurrency) ] concurrent.futures.wait(fs) elapsed = perf_counter() - start print( f"Serial in {concurrency} threads, one data copy: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec" ) # Numba parallelism in serial set_num_threads(concurrency) start = perf_counter() for _ in range(n_ops): regular_laplacian_numba_jit_parallel(data, out=out) elapsed = perf_counter() - start print(f"Numba parallel: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec") ```

The results I initially got were that whichever of these cases I ran first was the fastest. Commenting the other two out and running one at a time:

(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.58 sec, 873.3 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 4.02 sec, 993.8 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 2.11 sec, 1895.6 ops/sec

vs all at once:

(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.55 sec, 879.5 ops/sec
Serial in 4 threads, one data copy: 5.42 sec, 738.2 ops/sec
Numba parallel: 5.11 sec, 782.8 ops/sec

Notice "Numba parallel" takes 2.5x longer when run last versus when run alone.

The order-dependence made me think caching/memory access patterns, so I tried running under jemalloc:

(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.95 sec, 446.9 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, one data copy: 8.20 sec, 487.8 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Numba parallel: 12.48 sec, 320.6 ops/sec

(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.62 sec, 464.0 ops/sec
Serial in 4 threads, one data copy: 8.15 sec, 490.6 ops/sec
Numba parallel: 12.85 sec, 311.3 ops/sec

On macOS, jemalloc removes the order-dependence effect. However, it also makes naive parallelism 2x slower and numba parallelism 6x slower. This seems to confirm that something related to memory is very important here. In particular, I guessed that switching memory allocators more affects memory allocation performance than memory access (though allocators certainly could have different strategies for optimizing cache performance, etc.).

Thinking about memory allocation, I noticed we were generating a new output array for each call with np.zeros_like(a). Just switching to np.empty_like boosted performance 25%-50%.

Then I added an out= argument, and re-used the same preallocated output array for every call. This gets naive parallelism to 2x our starting point, but interestingly didn't help Numba parallelism much compared to np.zeros_like. Most importantly, it mostly eliminated the order effect and the jemalloc slowdown.

(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 2.30 sec, 1740.4 ops/sec
Serial in 4 threads, one data copy: 2.19 sec, 1829.7 ops/sec
Numba parallel: 1.34 sec, 2977.4 ops/sec
(env) gabe dev/dask-playground » python test.py                                                                      
Serial in 4 threads, 4 data copies: 2.56 sec, 1561.3 ops/sec
Serial in 4 threads, one data copy: 2.18 sec, 1835.8 ops/sec
Numba parallel: 1.37 sec, 2929.7 ops/sec

(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 2.48 sec, 1613.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 2.26 sec, 1768.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 1.34 sec, 2980.7 ops/sec

In the end, this tells us pretty much what we already guessed from looking at the code: the operation itself is computationally very simple, so memory bandwidth is the limiting factor. Anything you can do to reduce new memory allocations and reuse existing arrays will have the biggest performance wins.

rabernat commented 3 years ago

I have a version below that correctly wraps around at the edges.

This is exactly what I shared in https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-884498140 😄

the operation itself is computationally very simple, so memory bandwidth is the limiting factor.

This is a really useful insight. It would be great to get some insight into the numba best practices for avoiding unnecessary memory allocation. For example, what's the right way to provide an out= option for a numba function?

In a similar vein, when using the stencil code path, I had to manually pad the array in order to deal with the boundary conditions:

@njit
def regular_laplacian_numba_stencil_fix_boundary(a):
    padded = pad_array(a)
    b = _regular_laplacian_numba_stencil(padded)
    return b[1:-1, 1:-1]

This involves a copy of the whole array. Is there a better way?

gjoseph92 commented 3 years ago

This is exactly what I shared in #45 (comment) 😄

Aha! Sorry I missed that. I might suggest using conditionals like a[j - 1 if j != 0 else last_j, i] instead of (j - 1) % ny; IIRC it was a tiny bit faster. You should test though, that may also not be true.

I had to manually pad the array in order to deal with the boundary conditions...This involves a copy of the whole array. Is there a better way?

I think the better way is exactly what both you and I did, with writing the loop explicitly in regular_laplacian_numba_jit. Stencil seems like a handy but limited tool. If it doesn't meet your needs, you'd be better off writing the code yourself then trying to adjust the input data to match how stencil thinks about it.

In general, this is a reversal of mindset from NumPy/Python. With NumPy, we are constantly altering the data to make it work with the functions we have available to us so we don't have to write for-loops. With Numba, we want to alter the data as little as possible, and write exactly the function we need to handle it as is. Any time you can use conditionals/logic to solve a problem instead of memory, it'll pretty much always be faster.

rabernat commented 3 years ago

Posting an update with some results from casper with 36 cores.

import numpy as np
from numba import njit
from scipy.ndimage import uniform_filter

def numpy_laplacian(field):
     return ( 
         -4 * field 
         + np.roll(field, -1, axis=-1) 
         + np.roll(field, 1, axis=-1) 
         + np.roll(field, -1, axis=-2) 
         + np.roll(field, 1, axis=-2) 
     ) 

@njit(nogil=True, cache=False)
def regular_laplacian_numba_jit_serial(a, out=None):
    # Does handle boundaries correctly
    ny, nx = a.shape
    last_j, last_i = ny - 1, nx - 1
    out = np.empty_like(a) if out is None else out
    for j in range(ny):
        for i in range(nx):
            out[j, i] = (
                -4 * a[j, i]
                + a[j - 1 if j != 0 else last_j, i]
                + a[j + 1 if j != last_j else 0, i]
                + a[j, i + 1 if i != last_i else 0]
                + a[j, i - 1 if i != 0 else last_i]
            )
    return out

shape = (2048, 2048)
data = np.random.rand(*shape)
regular_laplacian_numba_jit_serial(data)

print("numpy laplacian")
%timeit -n10 numpy_laplacian(data)
%ptime -n36 numpy_laplacian(data)

print("ndimage.uniform_filter")
%timeit -n10 uniform_filter(data)
%ptime -n36 uniform_filter(data)

print("numba laplacian")
%timeit -n10 regular_laplacian_numba_jit_serial(data)
%ptime -n36 regular_laplacian_numba_jit_serial(data)

results

numpy laplacian
74 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   2.67 s
Total parallel time: 0.38 s
For a 6.94X speedup across 36 threads
ndimage.uniform_filter
106 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   3.80 s
Total parallel time: 0.29 s
For a 13.01X speedup across 36 threads
numba laplacian
15.6 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.57 s
Total parallel time: 0.05 s
For a 10.44X speedup across 36 threads

So the numba version is about 5x faster and has slightly better parallel scaling than the numpy one, but not quite as good as ndimage.

What we have to decide now is whether these performance improvements are worth refactoring the package to use numba.

NoraLoose commented 3 years ago

Thanks for the update, @rabernat!

So the numba version is about 5x faster and has slightly better parallel scaling than the numpy one, but not quite as good as ndimage. What we have to decide now is whether these performance improvements are worth refactoring the package to use numba.

To me, this performance improvement makes refactoring seem worthwhile. What do you think?

rabernat commented 3 years ago

The complicating factor is GPU support. If we go with numba, supporting GPUs becomes more complicated (but not impossible). Whereas now, we basically get cupy-based GPU support for free.

It would be good to assess how difficult it would be to make the numba-cuda implementation of the kernels. I don't really know where to start with that.

mrocklin commented 3 years ago

Did you happen to see the numba-cuda example at the end of https://examples.dask.org/applications/stencils-with-numba.html#GPU-Version ?

On Tue, Aug 17, 2021 at 7:32 AM Ryan Abernathey @.***> wrote:

The complicating factor is GPU support. If we go with numba, supporting GPUs becomes more complicated (but not impossible). Whereas now, we basically get cupy-based GPU support for free.

It would be good to assess how difficult it would be to make the numba-cuda implementation of the kernels. I don't really know where to start with that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-900256670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTE4PBPG2JWQWMYEBRDT5JJD7ANCNFSM44FKBUCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

rabernat commented 3 years ago

Thanks for the reminder about that Matt, very helpful.

Our current implementation requires only function for each Laplacian. Those functions automatically select either cupy or numpy based on the input data, e.g.:

https://github.com/ocean-eddy-cpt/gcm-filters/blob/d5ea257299882a3110e992335dda51aeeb4718ff/gcm_filters/kernels.py#L108-L109

The Dask example brings to mind several questions.

Would we have to write a separate version of each function for CPU and GPU? We need to support both.
How do we choose threadsperblock and blockspergrid?
Would boundary conditions be hard to handle in numba.cuda.jit? Your example basically ignores the array boundary, but we would need to exchange data across the boundaries of the array in complex ways (starting with simple "wrap" boundary conditions but also more specialized things like the tripolar grid)

I wonder if there are any examples out there of numba functions that are jit compiled to both CPU and GPU.

mrocklin commented 3 years ago

Would we have to write a separate version of each function for CPU and GPU? We need to support both.

Today yes, probably. GPUs and CPUs are different enough that different programming techniques are often needed. The counter-example is if you're doing something very simple, like something entirely vectorizable such that you can use numba.vectorize. I don't think that you're in this regime though.

How do we choose threadsperblock and blockspergrid?

That's a great question, and there isn't a good general answer I don't think unfortunately today. cc'ing @gmarkall from the numba/rapids team. I don't think that he'll have a general purpose answer for you, but we've spoken a bit about this before and I think that he'll appreciate seeing the context of where this comes up.

Would boundary conditions be hard to handle in numba.cuda.jit? Your example basically ignores the array boundary, but we would need to exchange data across the boundaries of the array in complex ways (starting with simple "wrap" boundary conditions but also more specialized things like the tripolar grid)

You can use if statements in numba.cuda, which is maybe how you would handle the boundary.

if i < 5:
    out[i, j] = 0

I wonder if there are any examples out there of numba functions that are jit compiled to both CPU and GPU.

Outside of things like numba.vectorize or guvectorize I don't think that this is really possible with the CUDA programming model. I would be happy to be wrong here though.

rabernat commented 3 years ago

You can use if statements in numba.cuda, which is maybe how you would handle the boundary.

What we need is more like this

            out[j, i] = (
                -4 * a[j, i]
                + a[j - 1 if j != 0 else last_j, i]
                + a[j + 1 if j != last_j else 0, i]
                + a[j, i + 1 if i != last_i else 0]
                + a[j, i - 1 if i != 0 else last_i]
            )

...where the i, j indexes are global. My concern is that the cuda kernel only has access to a local block (comparable to an MPI rank for the Fortran folks) and therefore can't trivially do wrap boundary conditions. I'm sure it's possible to do, just more complicated.

mrocklin commented 3 years ago

My CUDA is rusty enough that I no longer trust myself to talk about its memory model, but I think that it would be worth verifying the assumption about lack of global access.

shwina commented 3 years ago

@mrocklin is right -- each thread does have access to all of global memory. That being said, stencil kernels often benefit from the use of block-local "shared" memory (which numba exposes).

This article covers how to use shared memory in a very similar use-case to the above (3-d finite difference with periodic boundary conditions): https://developer.nvidia.com/blog/finite-difference-methods-cuda-cc-part-1/

rabernat commented 3 years ago

FWIW, I think we do not have to resolve this before submitting the JOSS paper. We can leave this issue open and continue to work on performance.

rabernat commented 2 years ago

The numba documentation just got expanded with a bunch of more detail on writing CUDA kernels: https://numba.readthedocs.io/en/latest/cuda/examples.html

This should be very useful if we choose to go that route.

ocean-eddy-cpt / gcm-filters

Performance evaluation #45