Open armintoepfer opened 4 years ago
@armintoepfer : ah, great to hear ; mimalloc is used on big workloads internally and this is usually my main target for perf improvements nowadays. Now, performance for allocators is highly workload dependent so I guess mimalloc cannot always be best -- nevertheless, in all my benchmarking mimalloc is always faster than glibc so I am surprised to see it doing worse than glibc on the small workload.
Is there a way for me to run your workload locally? best if I can compile it as well so I can see the kind of allocations better.
tuning wise, you could try:
dev
versionMIMALLOC_PAGE_RESET=0
MIMALLOC_RESERVE_HUGE_OS_PAGES=N
: https://github.com/microsoft/mimalloc#environment-options and how to enable this on Linux)MIMALLOC_USE_NUMA_NODES=1
-- I had reports about some trouble with NUMA nodes on the AMD systems so maybe this helps.MIMALLOC_EAGER_COMMIT_DELAY=0
dev-slice
branch and see how it does; currently sometimes faster but not always, tends to use
less memory)One of these may help. Let me know how it goes. Even if it improves, I am still interested in your workload and figure out why mimalloc is not performing as well at it should on that workload. Best, Daan
Thank you for the hints, Daan. I will try each of those options.
Regarding huge pages, is that per thread? I'm running 256 threads. Is that going to be a problem?
Is there a way for me to run your workload locally? best if I can compile it as well so I can see the kind of allocations better.
This is not possible, as this is closed source.
Armin
Thanks. The huge OS pages are allocated per-process and stay pinned in memory (e.g. there is no swap for those) and use a single TLB entry. So, the number of threads do not matter for this. If you run on dedicated hardware, it is usually good for performance to match your usual working set. The drawbacks of huge pages is mostly that 1) they can take a while to set up (tens of seconds on memory fragmented systems) and 2) other processes cannot use/share that memory (which is why one usually needs to set permissions).
Hi!
First of all, legendary work! I've been using mimalloc for https://github.com/PacificBiosciences/ccs to saturate 256 threads on those new AMD 2x7742 or 2x7H12 servers. I'm blown away how much faster it is, compared to glibc 2.32. Especially running with large os pages. There is one edge case, mimalloc is still slower than the arena allocator of the latest glibc, if I use it only a few threads like 16 on a large machine.
I'm aware that's ricing, but in production, our run times are ~30 hours, so every percent wall time counts.
I build mimalloc from source
and then link it statically
My question, is there any way to tune mimalloc to be as fast as the latest glibc allocator for lower number of threads?
Thank you! Armin