solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
12.95k stars 4.16k forks source link

Poor jemalloc performance with zeroed allocations leading to TLB shootdown #27275

Open alessandrod opened 2 years ago

alessandrod commented 2 years ago

Problem

While profiling a branch including all the patches needed to bring direct account mapping with abiv1, I noticed a very large amount of TLB flushes and page faults caused by the program runtime. Initially I feared that direct mapping changes were somehow causing the issue, but I've now observed that the problem can happen in master as well. Direct mapping does seem to make it worse, most likely by making the program runtime threads a lot faster (the irony!).

The problem is the following:

Screen Shot 2022-08-19 at 7 48 38 pm

It looks like jemalloc always force-purges zeroed extents immediately, instead of implementing two phase release like it does for non-zeroed allocations. Two phase cleanup reduces overhead from allocating/deallocting memory, at the expense of retaining a bit more memory during the decay period. Furthermore, jemalloc purges zeroed extents by using madvise(MADV_DONTNEED) which requires a TLB flush - and with our allocation sizes - a full TLB flush (the theory being that doing a full flush is faster than flushing the individual page entries).

Since we run the program runtime inside rayon, we have a bunch of threads constantly flushing TLBs, therefore getting into a by the book TLB shootdown (https://web.njit.edu/~dingxn/papers/ispa20.pdf).

To confirm that the shootdown is caused by the interaction between the rayon thread pool and jemalloc (the default glibc allocator doesn't exhibit the problem), I've written a minimal test case which mimics the CallFrame allocation we do in the program runtime: https://gist.github.com/alessandrod/a80788429873a4b9caa6aa53a82e0b2b

Here's perf numbers on a 64 vcpu gcloud vm:

$ hyperfine -i -L alloc malloc_memset,calloc,calloc_slab  'target/release/examples/mem {alloc}'
Benchmark 1: target/release/examples/mem malloc_memset
  Time (mean ± σ):     122.5 ms ±  22.5 ms    [User: 1566.1 ms, System: 113.2 ms]
  Range (min … max):    59.3 ms … 176.6 ms    23 runs

Benchmark 2: target/release/examples/mem calloc
  Time (mean ± σ):     260.2 ms ±  28.7 ms    [User: 370.3 ms, System: 5734.7 ms]
  Range (min … max):   207.0 ms … 293.6 ms    10 runs

Benchmark 3: target/release/examples/mem calloc_slab
  Time (mean ± σ):      94.5 ms ±  10.0 ms    [User: 85.6 ms, System: 237.9 ms]
  Range (min … max):    64.8 ms … 123.1 ms    28 runs

Summary
  'target/release/examples/mem calloc_slab' ran
    1.30 ± 0.27 times faster than 'target/release/examples/mem malloc_memset'
    2.75 ± 0.42 times faster than 'target/release/examples/mem calloc'

You can see that calloc is awfully slower than malloc_memset, even though the latter causes nearly twice as many page faults as it pages in the whole allocation to zero it.

calloc_slab works around the problem by pre-allocating large zero extents and then purging in one go, therefore doing only one TLB flush when the whole slab is deallocated. This confirms that the problem is caused by releasing many small calloc allocations. I've prototyped this for the program runtime - one slab per transaction execution. Unfortunately since we don't have a hard max number of instructions that can be executed per transaction, the slab needs to be quite large and while it improves perf, it also increases peak virtual memory usage significantly (although actual paged in memory stays lower than with malloc_memset).

Jemalloc implements two levels of caching: a small lock-free, per-thread cache and then larger arenas shared among threads. Turns out one way to avoid this particular issue is to make sure that the allocation fits in the per-thread cache (default is 32k, here I bumped it to 256k):

$ MALLOC_CONF=tcache_max:262144 hyperfine -i -L alloc malloc_memset,calloc,calloc_slab  'target/release/examples/mem {alloc}'
Benchmark 1: target/release/examples/mem malloc_memset
  Time (mean ± σ):     131.8 ms ±   9.1 ms    [User: 1346.1 ms, System: 113.6 ms]
  Range (min … max):   119.6 ms … 149.5 ms    22 runs

Benchmark 2: target/release/examples/mem calloc
  Time (mean ± σ):     135.6 ms ±   8.5 ms    [User: 1404.5 ms, System: 127.1 ms]
  Range (min … max):   124.7 ms … 154.3 ms    21 runs

Benchmark 3: target/release/examples/mem calloc_slab
  Time (mean ± σ):     100.5 ms ±   8.6 ms    [User: 104.2 ms, System: 308.4 ms]
  Range (min … max):    88.0 ms … 132.5 ms    30 runs

Summary
  'target/release/examples/mem calloc_slab' ran
    1.31 ± 0.14 times faster than 'target/release/examples/mem malloc_memset'
    1.35 ± 0.14 times faster than 'target/release/examples/mem calloc'

Proposed Solution

Has anyone looked into tuning jemalloc for the validator? This issue aside I see that there's quite a bit of memory churn, so I'm tempted to fix this issue (and possibly more), by running the jemalloc profiler and making sure that more allocations get cached.

alessandrod commented 2 years ago

Btw for the lols: if you look at the stack trace, there's a _rjem_je_ehooks_default_zero_impl callback. Great! I thought I'll implement my callback and make it not purge so often. Then I found this https://github.com/jemalloc/jemalloc/blob/deb8e62a837b6dd303128a544501a7dc9677e47a/include/jemalloc/internal/ehooks.h#L367

ryoqun commented 2 years ago

hehe, nice finding.

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

alessandrod commented 2 years ago

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

I thought about that and it'd be fairly easy to implement. Max frame size is fixed and CPIs are nested in the host stack too so we don't even need alloca. But it would merge the SBF stack with the host stack, which from a security perspective isn't worth the tradeoff I think.

ryoqun commented 4 months ago

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

after almost 2 years, i finally got my hands on this: https://github.com/anza-xyz/agave/pull/1364