Open alessandrod opened 2 years ago
Btw for the lols: if you look at the stack trace, there's a _rjem_je_ehooks_default_zero_impl
callback. Great! I thought I'll implement my callback and make it not purge so often. Then I found this https://github.com/jemalloc/jemalloc/blob/deb8e62a837b6dd303128a544501a7dc9677e47a/include/jemalloc/internal/ehooks.h#L367
hehe, nice finding.
i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.
i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.
I thought about that and it'd be fairly easy to implement. Max frame size is fixed and CPIs are nested in the host stack too so we don't even need alloca. But it would merge the SBF stack with the host stack, which from a security perspective isn't worth the tradeoff I think.
i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.
after almost 2 years, i finally got my hands on this: https://github.com/anza-xyz/agave/pull/1364
Problem
While profiling a branch including all the patches needed to bring direct account mapping with abiv1, I noticed a very large amount of TLB flushes and page faults caused by the program runtime. Initially I feared that direct mapping changes were somehow causing the issue, but I've now observed that the problem can happen in master as well. Direct mapping does seem to make it worse, most likely by making the program runtime threads a lot faster (the irony!).
The problem is the following:
It looks like jemalloc always force-purges zeroed extents immediately, instead of implementing two phase release like it does for non-zeroed allocations. Two phase cleanup reduces overhead from allocating/deallocting memory, at the expense of retaining a bit more memory during the decay period. Furthermore, jemalloc purges zeroed extents by using
madvise(MADV_DONTNEED)
which requires a TLB flush - and with our allocation sizes - a full TLB flush (the theory being that doing a full flush is faster than flushing the individual page entries).Since we run the program runtime inside rayon, we have a bunch of threads constantly flushing TLBs, therefore getting into a by the book TLB shootdown (https://web.njit.edu/~dingxn/papers/ispa20.pdf).
To confirm that the shootdown is caused by the interaction between the rayon thread pool and jemalloc (the default glibc allocator doesn't exhibit the problem), I've written a minimal test case which mimics the
CallFrame
allocation we do in the program runtime: https://gist.github.com/alessandrod/a80788429873a4b9caa6aa53a82e0b2bHere's perf numbers on a 64 vcpu gcloud vm:
You can see that
calloc
is awfully slower thanmalloc_memset
, even though the latter causes nearly twice as many page faults as it pages in the whole allocation to zero it.calloc_slab
works around the problem by pre-allocating large zero extents and then purging in one go, therefore doing only one TLB flush when the whole slab is deallocated. This confirms that the problem is caused by releasing many small calloc allocations. I've prototyped this for the program runtime - one slab per transaction execution. Unfortunately since we don't have a hard max number of instructions that can be executed per transaction, the slab needs to be quite large and while it improves perf, it also increases peak virtual memory usage significantly (although actual paged in memory stays lower than withmalloc_memset
).Jemalloc implements two levels of caching: a small lock-free, per-thread cache and then larger arenas shared among threads. Turns out one way to avoid this particular issue is to make sure that the allocation fits in the per-thread cache (default is 32k, here I bumped it to 256k):
Proposed Solution
Has anyone looked into tuning jemalloc for the validator? This issue aside I see that there's quite a bit of memory churn, so I'm tempted to fix this issue (and possibly more), by running the jemalloc profiler and making sure that more allocations get cached.