Memory allocations in Numba significantly slower than allocations with Numpy

pwuertz commented 2 months ago

I'm seeing some strange behavior where the performance of my numba function jumps between different "states" depending on some system state I can't seem to pin down.

This is my test function:

import numpy as np
import numba as nb

@nb.njit(fastmath=True, nogil=True)
def func(z1, z2):
    out = z1 + z2
    tmp1 = np.empty_like(z1)
    tmp1[:] = out
    return out, tmp1

In Jupyter, timing func multiple times with the same input data yet newly created arrays for each %timeit:

for _ in range(8):
    z1 = np.random.default_rng(seed=42).random(4 * 1024 * 1024, dtype=np.float32)
    z2 = np.random.default_rng(seed=43).random(4 * 1024 * 1024, dtype=np.float32)
    %timeit func(z1, z2)

# 2.56 ms ± 51.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 15 ms ± 206 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 15 ms ± 25.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.42 ms ± 46.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 15.2 ms ± 172 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 15.2 ms ± 133 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.46 ms ± 46.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 15.1 ms ± 146 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numba performance randomly jumps between 2.5 ms and 15 ms, yet it is consistent as long as long as I'm not recreating the input arrays. This is a very good run, most of the time this will just stick to the 15 ms state (in fact, I currently can't reproduce the 2.5 ms state anymore).

The Python / Numpy function also has seems to have a good- and a bad state, but it is much less pronounced:

for _ in range(8):
    z1 = np.random.default_rng(seed=42).random(4 * 1024 * 1024, dtype=np.float32)
    z2 = np.random.default_rng(seed=43).random(4 * 1024 * 1024, dtype=np.float32)
    %timeit func.py_func(z1, z2)

# 2.36 ms ± 14.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.3 ms ± 73.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 4.89 ms ± 49.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.37 ms ± 47.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.17 ms ± 8.54 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 4.9 ms ± 62.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.21 ms ± 28.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.2 ms ± 32.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I really don't know where those 15ms in Numba are coming from. Memory allocations shouldn't take this long, and Numba seems to be capable to execute in 2.5 ms using the same compiled function.

Test environment:

Numba version: 0.60
Numpy version: 2.0.1
OS: Ubuntu 24.04
CPU: AMD Ryzen 9 5900X

pwuertz commented 2 months ago

Update: When moving all memory allocations from Numba to Numpy, the total runtime (Numpy allocations + Numba call without allocations) is on-par with Numpy again.

@nb.njit(fastmath=True, nogil=True)
def func_without_alloc(z1, z2, out1, out2):
    for i in range(z1.size):
        out1[i] = z1[i] + z2[i]
    out2[:] = out1
    return out1, out2

def func_alloc_wrapper(z1, z2):
    out1 = np.empty_like(z1)
    out2 = np.empty_like(z1)
    return func_without_alloc(z1, z2, out1, out2)

I was also a bit surprised that Numba doesn't elide the creation of a temporary array in out1[:] = z1 + z2. Only a manual iteration like for i in range(z1.size): out1[i] = z1[i] + z2[i] prevents this.

stuartarchibald commented 2 months ago

Thanks for the report. A similar effect to that in the OP (somewhat random bimodal distribution of performance) was reproducible on Linux for the NumPy run but not for Numba, the magnitude of the difference was about 2x. Trying the same on OSX with Apple silicon showed similar to that in the OP, but, the "peaks" in the distribution were slower for NumPy and the magnitude of the difference about 30%. I'm suspicious that this is hardware/low level OS related, perhaps to do with the memory system. An observation that may support this is that on linux, mallinfo(3) reports no change in the arena size in the case of a "slow" run, and a positive change in the arena size in the "fast" run.

RE temporary arrays. Numba has some optimisations that can remove temporary arrays across terms in a ufunc like expression, the parallel=True target also has a number of more involved optimisations. However, as noted, a good way to guarantee the allocation behaviour is to control it via explicit allocation routine calls and then write explicit loops acting on that memory.

github-actions[bot] commented 1 month ago

This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.

pwuertz commented 1 month ago

Thanks for your insights!

I'm still planning on following up on your arena size idea and see if timings correlate the same way on my end (unfortunately I won't be able to get to it until some time next month due to moving house).

If I can, I'll try to consolidate this for different OS and or CPU types as well.

github-actions[bot] commented 2 weeks ago

This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.

pwuertz commented 2 weeks ago

I think I can confirm the correlation between fast/slow runs with specific changes in memory arena.

This is the set up for calling mallinfo2 from libc:

from cffi import FFI

ffi = FFI()
ffi.cdef("""
struct mallinfo2 {
    size_t arena;     /* Non-mmapped space allocated (bytes) */
    size_t ordblks;   /* Number of free chunks */
    size_t smblks;    /* Number of free fastbin blocks */
    size_t hblks;     /* Number of mmapped regions */
    size_t hblkhd;    /* Space allocated in mmapped regions
                        (bytes) */
    size_t usmblks;   /* See below */
    size_t fsmblks;   /* Space in freed fastbin blocks (bytes) */
    size_t uordblks;  /* Total allocated space (bytes) */
    size_t fordblks;  /* Total free space (bytes) */
    size_t keepcost;  /* Top-most, releasable space (bytes) */
};

struct mallinfo2 mallinfo2(void);
""")
libc = ffi.dlopen("libc.so.6")

Here is a more minimal example, with arena size checks included:

@nb.njit(nogil=True)
def test_fn(z):
    out1 = np.empty_like(z1)
    out2 = np.empty_like(z1)
    out1[:] = 1.0
    return out1, out2

for _ in range(8):
    arena1 = libc.mallinfo2().arena
    z1 = np.random.default_rng(seed=42).random(4 * 1024 * 1024, dtype=np.float32)
    z2 = np.random.default_rng(seed=43).random(4 * 1024 * 1024, dtype=np.float32)
    arena2 = libc.mallinfo2().arena
    # Switch between numpy and numba here ->
    # %timeit test_fn(z1)
    %timeit test_fn.py_func(z1)
    arena3 = libc.mallinfo2().arena
    arena_diff1_mb = round((arena2 - arena1) * 1e-6)
    arena_diff2_mb = round((arena3 - arena2) * 1e-6)
    print(f"  (arena diffs: {arena_diff1_mb} MB, {arena_diff2_mb} MB)")

For Numpy (using test_fn.py_func):

236 μs ± 6.53 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: -17 MB, 17 MB)
1.38 ms ± 29.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: -34 MB, 0 MB)
228 μs ± 972 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: 17 MB, 17 MB)
229 μs ± 1.15 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: -17 MB, 17 MB)
1.36 ms ± 24.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: -34 MB, 0 MB)
263 μs ± 529 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: 17 MB, 17 MB)
263 μs ± 623 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: -17 MB, 17 MB)
1.42 ms ± 6.19 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  (arena diffs: -34 MB, 0 MB)

So I'm seeing mostly fast runs, except when arena size shrunk by roughly two array sizes during the allocation of z1 and z2, and did not increase after finishing the test runs.

For Numba (using test_fn):

5.79 ms ± 87.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: -34 MB, 0 MB)
5.95 ms ± 89.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: 34 MB, 0 MB)
6.03 ms ± 32.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: 0 MB, -17 MB)
6.03 ms ± 23.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: 17 MB, 0 MB)
6.02 ms ± 32.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: 0 MB, 0 MB)
6.04 ms ± 24.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: 0 MB, -17 MB)
6.11 ms ± 140 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: 17 MB, 0 MB)
5.98 ms ± 80.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  (arena diffs: 0 MB, 0 MB)

I'm having trouble recreating fast runs with numba at the moment. Right now, Numba is 4x to 20x slower than Numpy when having to deal with memory allocations.

numba / numba

Memory allocations in Numba significantly slower than allocations with Numpy #9712