Open pwuertz opened 2 months ago
Update: When moving all memory allocations from Numba to Numpy, the total runtime (Numpy allocations + Numba call without allocations) is on-par with Numpy again.
@nb.njit(fastmath=True, nogil=True)
def func_without_alloc(z1, z2, out1, out2):
for i in range(z1.size):
out1[i] = z1[i] + z2[i]
out2[:] = out1
return out1, out2
def func_alloc_wrapper(z1, z2):
out1 = np.empty_like(z1)
out2 = np.empty_like(z1)
return func_without_alloc(z1, z2, out1, out2)
I was also a bit surprised that Numba doesn't elide the creation of a temporary array in out1[:] = z1 + z2
. Only a manual iteration like for i in range(z1.size): out1[i] = z1[i] + z2[i]
prevents this.
Thanks for the report. A similar effect to that in the OP (somewhat random bimodal distribution of performance) was reproducible on Linux for the NumPy run but not for Numba, the magnitude of the difference was about 2x. Trying the same on OSX with Apple silicon showed similar to that in the OP, but, the "peaks" in the distribution were slower for NumPy and the magnitude of the difference about 30%. I'm suspicious that this is hardware/low level OS related, perhaps to do with the memory system. An observation that may support this is that on linux, mallinfo(3)
reports no change in the arena
size in the case of a "slow" run, and a positive change in the arena
size in the "fast" run.
RE temporary arrays. Numba has some optimisations that can remove temporary arrays across terms in a ufunc
like expression, the parallel=True
target also has a number of more involved optimisations. However, as noted, a good way to guarantee the allocation behaviour is to control it via explicit allocation routine calls and then write explicit loops acting on that memory.
This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.
Thanks for your insights!
I'm still planning on following up on your arena
size idea and see if timings correlate the same way on my end (unfortunately I won't be able to get to it until some time next month due to moving house).
If I can, I'll try to consolidate this for different OS and or CPU types as well.
This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.
I think I can confirm the correlation between fast/slow runs with specific changes in memory arena.
This is the set up for calling mallinfo2
from libc
:
from cffi import FFI
ffi = FFI()
ffi.cdef("""
struct mallinfo2 {
size_t arena; /* Non-mmapped space allocated (bytes) */
size_t ordblks; /* Number of free chunks */
size_t smblks; /* Number of free fastbin blocks */
size_t hblks; /* Number of mmapped regions */
size_t hblkhd; /* Space allocated in mmapped regions
(bytes) */
size_t usmblks; /* See below */
size_t fsmblks; /* Space in freed fastbin blocks (bytes) */
size_t uordblks; /* Total allocated space (bytes) */
size_t fordblks; /* Total free space (bytes) */
size_t keepcost; /* Top-most, releasable space (bytes) */
};
struct mallinfo2 mallinfo2(void);
""")
libc = ffi.dlopen("libc.so.6")
Here is a more minimal example, with arena size checks included:
@nb.njit(nogil=True)
def test_fn(z):
out1 = np.empty_like(z1)
out2 = np.empty_like(z1)
out1[:] = 1.0
return out1, out2
for _ in range(8):
arena1 = libc.mallinfo2().arena
z1 = np.random.default_rng(seed=42).random(4 * 1024 * 1024, dtype=np.float32)
z2 = np.random.default_rng(seed=43).random(4 * 1024 * 1024, dtype=np.float32)
arena2 = libc.mallinfo2().arena
# Switch between numpy and numba here ->
# %timeit test_fn(z1)
%timeit test_fn.py_func(z1)
arena3 = libc.mallinfo2().arena
arena_diff1_mb = round((arena2 - arena1) * 1e-6)
arena_diff2_mb = round((arena3 - arena2) * 1e-6)
print(f" (arena diffs: {arena_diff1_mb} MB, {arena_diff2_mb} MB)")
For Numpy (using test_fn.py_func
):
236 μs ± 6.53 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: -17 MB, 17 MB)
1.38 ms ± 29.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: -34 MB, 0 MB)
228 μs ± 972 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: 17 MB, 17 MB)
229 μs ± 1.15 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: -17 MB, 17 MB)
1.36 ms ± 24.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: -34 MB, 0 MB)
263 μs ± 529 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: 17 MB, 17 MB)
263 μs ± 623 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: -17 MB, 17 MB)
1.42 ms ± 6.19 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(arena diffs: -34 MB, 0 MB)
So I'm seeing mostly fast runs, except when arena size shrunk by roughly two array sizes during the allocation of z1
and z2
, and did not increase after finishing the test runs.
For Numba (using test_fn
):
5.79 ms ± 87.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: -34 MB, 0 MB)
5.95 ms ± 89.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: 34 MB, 0 MB)
6.03 ms ± 32.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: 0 MB, -17 MB)
6.03 ms ± 23.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: 17 MB, 0 MB)
6.02 ms ± 32.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: 0 MB, 0 MB)
6.04 ms ± 24.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: 0 MB, -17 MB)
6.11 ms ± 140 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: 17 MB, 0 MB)
5.98 ms ± 80.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(arena diffs: 0 MB, 0 MB)
I'm having trouble recreating fast runs with numba at the moment. Right now, Numba is 4x to 20x slower than Numpy when having to deal with memory allocations.
I'm seeing some strange behavior where the performance of my numba function jumps between different "states" depending on some system state I can't seem to pin down.
This is my test function:
In Jupyter, timing
func
multiple times with the same input data yet newly created arrays for each%timeit
:Numba performance randomly jumps between
2.5 ms
and15 ms
, yet it is consistent as long as long as I'm not recreating the input arrays. This is a very good run, most of the time this will just stick to the 15 ms state (in fact, I currently can't reproduce the 2.5 ms state anymore).The Python / Numpy function also has seems to have a good- and a bad state, but it is much less pronounced:
I really don't know where those 15ms in Numba are coming from. Memory allocations shouldn't take this long, and Numba seems to be capable to execute in 2.5 ms using the same compiled function.
Test environment: