Open jbusecke opened 3 years ago
Have you tried @njit(nogil=True)
?
Also, I'd be curious how @njit(parallel=True)
and @njit(parallel=False)
compare under plain serial %timeit
.
Have you tried
@njit(nogil=True)
?
This was probably the trick I was looking for! With @njit(nogil=True)
on all the functions above, I get:
2.77 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 1.33X speedup across 4 threads
5.14 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.02 s
Total parallel time: 0.01 s
For a 1.49X speedup across 4 threads
7.11 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.03 s
Total parallel time: 0.01 s
For a 3.35X speedup across 4 threads
Adding parallel=True
just worked with the stencil functions. For regular_laplacian_numba_jit
, I replaced the outer j loop with prange, and got the following timings
1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 0.90X speedup across 4 threads
4.46 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.02 s
Total parallel time: 0.02 s
For a 0.85X speedup across 4 threads
4.01 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 1.28X speedup across 4 threads
So parallel
does seem to speed some things up, but at the expense of other layers of parallel scaling.
Any advice on where our parallelism would best be spent? We could either be using dask or numba to achieve single-machine parallel scaling. Using both would probably not be the right choice. Is there a best practice here?
Whatever works :)
On Mon, Jul 26, 2021 at 8:52 AM Ryan Abernathey @.***> wrote:
Have you tried @njit(nogil=True)?
This was probably the trick I was looking for! With @njit(nogil=True) on all the functions above, I get:
2.77 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 1.33X speedup across 4 threads
5.14 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.02 s
Total parallel time: 0.01 s
For a 1.49X speedup across 4 threads
7.11 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.03 s
Total parallel time: 0.01 s
For a 3.35X speedup across 4 threads
Adding parallel=True just worked with the stencil functions. For regular_laplacian_numba_jit, I replaced the outer j loop with prange, and got the following timings
1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 0.90X speedup across 4 threads
4.46 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.02 s
Total parallel time: 0.02 s
For a 0.85X speedup across 4 threads
4.01 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 1.28X speedup across 4 threads
So parallel does seem to speed some things up, but at the expense of other layers of parallel scaling.
Any advice on where our parallelism would best be spent? We could either be using dask or numba to achieve single-machine parallel scaling. Using both would probably not be the right choice. Is there a best practice here?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-886821344, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTEIA2C6QOIWY766BDDTZWAC7ANCNFSM44FKBUCA .
This was probably the trick I was looking for!
Great to confirm our theory that it was indeed the GIL. "3.35X speedup across 4 threads" sounds like the performance we were hoping for.
To note for the future, if you don't want to deal with perf
to run giltracer, I like py-spy top -- python my_script.py
and watching the GIL vs active percentages as a quick way to get a sense of the impact the GIL is having using py-spy. For example I modified your script from https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-884330141 and replaced the %timeits
with
with concurrent.futures.ThreadPoolExecutor() as exc:
for i in range(5000):
fs = [exc.submit(regular_laplacian_numba_jit, data) for _ in range(4)]
concurrent.futures.wait(fs)
and ran sudo py-spy top -- python test.py
. Without nogil=True
I'd see something like GIL: 100.00%, Active: 107.00%, Threads: 5
. With nogil=True
I'd see GIL: 23.00%, Active: 314.00%, Threads: 5
. The GIL: 100.00%
gives a strong indication that the GIL is the problem (and indeed, it's still probably the thing keeping us from a 4x speedup across 4 threads), so knowing that you might invest the time to do more detailed profiling with giltracer
.
So
parallel
does seem to speed some things up, but at the expense of other layers of parallel scaling.
This is what we'd expect. I wasn't thinking "we should parallelize parallel=True
"; I wanted to look specifically at the performance of Numba's lower-level parallelism vs the naive parallelism of running the same serial Numba operation in 4 threads at once. I was curious if Numba was able to be smarter about parallelism (particularly memory access) in some way.
We could either be using dask or numba to achieve single-machine parallel scaling.
I've tested this locally on macOS on intel, and Numba's parallelism seems ~50% faster than naive parallelism. However, you should test on a setup representative of what you'll actually run this on. There are many small details that affect this.
Is there a best practice here?
When it doesn't make a performance difference, I'd recommend fewer knobs to turn. So letting dask be the sole concurrency layer is simpler to reason about than layering a dask threadpool and a Numba threadpool.
However, Numba may be faster, so the complexity is probably worth it.
First a sidenote: I believe the above code doesn't actually handle boundary conditions correctly. For example, if you change shape
to 1025, it will segfault on my machine. At the last row, a[j + 1, i]
will be an out-of-bounds index. I have a version below that correctly wraps around at the edges. What you want is basically a[(j - 1) % ny, i]
, but using a conditional is a bit faster in my tests, since %
is a relatively expensive CPU operation.
The results I initially got were that whichever of these cases I ran first was the fastest. Commenting the other two out and running one at a time:
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.58 sec, 873.3 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 4.02 sec, 993.8 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 2.11 sec, 1895.6 ops/sec
vs all at once:
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.55 sec, 879.5 ops/sec
Serial in 4 threads, one data copy: 5.42 sec, 738.2 ops/sec
Numba parallel: 5.11 sec, 782.8 ops/sec
Notice "Numba parallel" takes 2.5x longer when run last versus when run alone.
The order-dependence made me think caching/memory access patterns, so I tried running under jemalloc:
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.95 sec, 446.9 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, one data copy: 8.20 sec, 487.8 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Numba parallel: 12.48 sec, 320.6 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.62 sec, 464.0 ops/sec
Serial in 4 threads, one data copy: 8.15 sec, 490.6 ops/sec
Numba parallel: 12.85 sec, 311.3 ops/sec
On macOS, jemalloc removes the order-dependence effect. However, it also makes naive parallelism 2x slower and numba parallelism 6x slower. This seems to confirm that something related to memory is very important here. In particular, I guessed that switching memory allocators more affects memory allocation performance than memory access (though allocators certainly could have different strategies for optimizing cache performance, etc.).
Thinking about memory allocation, I noticed we were generating a new output array for each call with np.zeros_like(a)
. Just switching to np.empty_like
boosted performance 25%-50%.
Then I added an out=
argument, and re-used the same preallocated output array for every call. This gets naive parallelism to 2x our starting point, but interestingly didn't help Numba parallelism much compared to np.zeros_like
. Most importantly, it mostly eliminated the order effect and the jemalloc slowdown.
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 2.30 sec, 1740.4 ops/sec
Serial in 4 threads, one data copy: 2.19 sec, 1829.7 ops/sec
Numba parallel: 1.34 sec, 2977.4 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 2.56 sec, 1561.3 ops/sec
Serial in 4 threads, one data copy: 2.18 sec, 1835.8 ops/sec
Numba parallel: 1.37 sec, 2929.7 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 2.48 sec, 1613.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 2.26 sec, 1768.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 1.34 sec, 2980.7 ops/sec
In the end, this tells us pretty much what we already guessed from looking at the code: the operation itself is computationally very simple, so memory bandwidth is the limiting factor. Anything you can do to reduce new memory allocations and reuse existing arrays will have the biggest performance wins.
I have a version below that correctly wraps around at the edges.
This is exactly what I shared in https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-884498140 😄
the operation itself is computationally very simple, so memory bandwidth is the limiting factor.
This is a really useful insight. It would be great to get some insight into the numba best practices for avoiding unnecessary memory allocation. For example, what's the right way to provide an out=
option for a numba function?
In a similar vein, when using the stencil code path, I had to manually pad the array in order to deal with the boundary conditions:
@njit
def regular_laplacian_numba_stencil_fix_boundary(a):
padded = pad_array(a)
b = _regular_laplacian_numba_stencil(padded)
return b[1:-1, 1:-1]
This involves a copy of the whole array. Is there a better way?
This is exactly what I shared in #45 (comment) 😄
Aha! Sorry I missed that. I might suggest using conditionals like a[j - 1 if j != 0 else last_j, i]
instead of (j - 1) % ny
; IIRC it was a tiny bit faster. You should test though, that may also not be true.
I had to manually pad the array in order to deal with the boundary conditions...This involves a copy of the whole array. Is there a better way?
I think the better way is exactly what both you and I did, with writing the loop explicitly in regular_laplacian_numba_jit
. Stencil seems like a handy but limited tool. If it doesn't meet your needs, you'd be better off writing the code yourself then trying to adjust the input data to match how stencil thinks about it.
In general, this is a reversal of mindset from NumPy/Python. With NumPy, we are constantly altering the data to make it work with the functions we have available to us so we don't have to write for-loops. With Numba, we want to alter the data as little as possible, and write exactly the function we need to handle it as is. Any time you can use conditionals/logic to solve a problem instead of memory, it'll pretty much always be faster.
Posting an update with some results from casper with 36 cores.
import numpy as np
from numba import njit
from scipy.ndimage import uniform_filter
def numpy_laplacian(field):
return (
-4 * field
+ np.roll(field, -1, axis=-1)
+ np.roll(field, 1, axis=-1)
+ np.roll(field, -1, axis=-2)
+ np.roll(field, 1, axis=-2)
)
@njit(nogil=True, cache=False)
def regular_laplacian_numba_jit_serial(a, out=None):
# Does handle boundaries correctly
ny, nx = a.shape
last_j, last_i = ny - 1, nx - 1
out = np.empty_like(a) if out is None else out
for j in range(ny):
for i in range(nx):
out[j, i] = (
-4 * a[j, i]
+ a[j - 1 if j != 0 else last_j, i]
+ a[j + 1 if j != last_j else 0, i]
+ a[j, i + 1 if i != last_i else 0]
+ a[j, i - 1 if i != 0 else last_i]
)
return out
shape = (2048, 2048)
data = np.random.rand(*shape)
regular_laplacian_numba_jit_serial(data)
print("numpy laplacian")
%timeit -n10 numpy_laplacian(data)
%ptime -n36 numpy_laplacian(data)
print("ndimage.uniform_filter")
%timeit -n10 uniform_filter(data)
%ptime -n36 uniform_filter(data)
print("numba laplacian")
%timeit -n10 regular_laplacian_numba_jit_serial(data)
%ptime -n36 regular_laplacian_numba_jit_serial(data)
results
numpy laplacian
74 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 2.67 s
Total parallel time: 0.38 s
For a 6.94X speedup across 36 threads
ndimage.uniform_filter
106 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 3.80 s
Total parallel time: 0.29 s
For a 13.01X speedup across 36 threads
numba laplacian
15.6 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.57 s
Total parallel time: 0.05 s
For a 10.44X speedup across 36 threads
So the numba version is about 5x faster and has slightly better parallel scaling than the numpy one, but not quite as good as ndimage.
What we have to decide now is whether these performance improvements are worth refactoring the package to use numba.
Thanks for the update, @rabernat!
So the numba version is about 5x faster and has slightly better parallel scaling than the numpy one, but not quite as good as ndimage. What we have to decide now is whether these performance improvements are worth refactoring the package to use numba.
To me, this performance improvement makes refactoring seem worthwhile. What do you think?
The complicating factor is GPU support. If we go with numba, supporting GPUs becomes more complicated (but not impossible). Whereas now, we basically get cupy-based GPU support for free.
It would be good to assess how difficult it would be to make the numba-cuda implementation of the kernels. I don't really know where to start with that.
Did you happen to see the numba-cuda example at the end of https://examples.dask.org/applications/stencils-with-numba.html#GPU-Version ?
On Tue, Aug 17, 2021 at 7:32 AM Ryan Abernathey @.***> wrote:
The complicating factor is GPU support. If we go with numba, supporting GPUs becomes more complicated (but not impossible). Whereas now, we basically get cupy-based GPU support for free.
It would be good to assess how difficult it would be to make the numba-cuda implementation of the kernels. I don't really know where to start with that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ocean-eddy-cpt/gcm-filters/issues/45#issuecomment-900256670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTE4PBPG2JWQWMYEBRDT5JJD7ANCNFSM44FKBUCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Thanks for the reminder about that Matt, very helpful.
Our current implementation requires only function for each Laplacian. Those functions automatically select either cupy or numpy based on the input data, e.g.:
The Dask example brings to mind several questions.
threadsperblock
and blockspergrid
?numba.cuda.jit
? Your example basically ignores the array boundary, but we would need to exchange data across the boundaries of the array in complex ways (starting with simple "wrap" boundary conditions but also more specialized things like the tripolar grid)I wonder if there are any examples out there of numba functions that are jit compiled to both CPU and GPU.
Would we have to write a separate version of each function for CPU and GPU? We need to support both.
Today yes, probably. GPUs and CPUs are different enough that different programming techniques are often needed. The counter-example is if you're doing something very simple, like something entirely vectorizable such that you can use numba.vectorize
. I don't think that you're in this regime though.
How do we choose threadsperblock and blockspergrid?
That's a great question, and there isn't a good general answer I don't think unfortunately today. cc'ing @gmarkall from the numba/rapids team. I don't think that he'll have a general purpose answer for you, but we've spoken a bit about this before and I think that he'll appreciate seeing the context of where this comes up.
Would boundary conditions be hard to handle in numba.cuda.jit? Your example basically ignores the array boundary, but we would need to exchange data across the boundaries of the array in complex ways (starting with simple "wrap" boundary conditions but also more specialized things like the tripolar grid)
You can use if statements in numba.cuda, which is maybe how you would handle the boundary.
if i < 5:
out[i, j] = 0
I wonder if there are any examples out there of numba functions that are jit compiled to both CPU and GPU.
Outside of things like numba.vectorize or guvectorize I don't think that this is really possible with the CUDA programming model. I would be happy to be wrong here though.
You can use if statements in numba.cuda, which is maybe how you would handle the boundary.
What we need is more like this
out[j, i] = (
-4 * a[j, i]
+ a[j - 1 if j != 0 else last_j, i]
+ a[j + 1 if j != last_j else 0, i]
+ a[j, i + 1 if i != last_i else 0]
+ a[j, i - 1 if i != 0 else last_i]
)
...where the i, j indexes are global. My concern is that the cuda kernel only has access to a local block (comparable to an MPI rank for the Fortran folks) and therefore can't trivially do wrap boundary conditions. I'm sure it's possible to do, just more complicated.
My CUDA is rusty enough that I no longer trust myself to talk about its memory model, but I think that it would be worth verifying the assumption about lack of global access.
@mrocklin is right -- each thread does have access to all of global memory. That being said, stencil kernels often benefit from the use of block-local "shared" memory (which numba exposes).
This article covers how to use shared memory in a very similar use-case to the above (3-d finite difference with periodic boundary conditions): https://developer.nvidia.com/blog/finite-difference-methods-cuda-cc-part-1/
FWIW, I think we do not have to resolve this before submitting the JOSS paper. We can leave this issue open and continue to work on performance.
The numba documentation just got expanded with a bunch of more detail on writing CUDA kernels: https://numba.readthedocs.io/en/latest/cuda/examples.html
This should be very useful if we choose to go that route.
JOSS askes for
If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
What should/can we do to document performance of the package?