mimalloc slower than glibc 2.30

armintoepfer commented 4 years ago

Hi!

First of all, legendary work! I've been using mimalloc for https://github.com/PacificBiosciences/ccs to saturate 256 threads on those new AMD 2x7742 or 2x7H12 servers. I'm blown away how much faster it is, compared to glibc 2.32. Especially running with large os pages. There is one edge case, mimalloc is still slower than the arena allocator of the latest glibc, if I use it only a few threads like 16 on a large machine.

On a small dataset, using 16 threads:	Allocator	Threads	Wall Time
mimalloc + large_os_pages	16	4m 35s	1h 12m
mimalloc	16	4m 36s	1h 13m
glibc 2.30	16	4m 15s	1h 7m

On a dataset with 10x more data, using 256 threads:	Allocator	Threads	Wall Time
mimalloc + large_os_pages	256	4m 43s	19h 04m
mimalloc	256	5m 03s	20h 25m
glibc 2.30	256	5m 18s	14h 58m

I'm aware that's ricing, but in production, our run times are ~30 hours, so every percent wall time counts.

I build mimalloc from source

cmake -GNinja -DCMAKE_INSTALL_PREFIX:PATH=${FOO}/software ..
ninja -v
ninja install

and then link it statically

LDFLAGS="${FOO}/software/lib/mimalloc-1.6/libmimalloc.a"

My question, is there any way to tune mimalloc to be as fast as the latest glibc allocator for lower number of threads?

Thank you! Armin

daanx commented 4 years ago

@armintoepfer : ah, great to hear ; mimalloc is used on big workloads internally and this is usually my main target for perf improvements nowadays. Now, performance for allocators is highly workload dependent so I guess mimalloc cannot always be best -- nevertheless, in all my benchmarking mimalloc is always faster than glibc so I am surprised to see it doing worse than glibc on the small workload.

Is there a way for me to run your workload locally? best if I can compile it as well so I can see the kind of allocations better.

tuning wise, you could try:

Get the latest dev version
Disable page reset: MIMALLOC_PAGE_RESET=0
Use huge OS pages instead of "large" ones (see here for MIMALLOC_RESERVE_HUGE_OS_PAGES=N: https://github.com/microsoft/mimalloc#environment-options and how to enable this on Linux)
Depending on the NUMA nodes, you may want to try also with MIMALLOC_USE_NUMA_NODES=1 -- I had reports about some trouble with NUMA nodes on the AMD systems so maybe this helps.
Disable eager commit delay: MIMALLOC_EAGER_COMMIT_DELAY=0
(Experimental: check out the dev-slice branch and see how it does; currently sometimes faster but not always, tends to use less memory)

One of these may help. Let me know how it goes. Even if it improves, I am still interested in your workload and figure out why mimalloc is not performing as well at it should on that workload. Best, Daan

armintoepfer commented 4 years ago

Thank you for the hints, Daan. I will try each of those options.

Regarding huge pages, is that per thread? I'm running 256 threads. Is that going to be a problem?

Is there a way for me to run your workload locally? best if I can compile it as well so I can see the kind of allocations better.

This is not possible, as this is closed source.

Armin

daanx commented 4 years ago

Thanks. The huge OS pages are allocated per-process and stay pinned in memory (e.g. there is no swap for those) and use a single TLB entry. So, the number of threads do not matter for this. If you run on dedicated hardware, it is usually good for performance to match your usual working set. The drawbacks of huge pages is mostly that 1) they can take a while to set up (tens of seconds on memory fragmented systems) and 2) other processes cannot use/share that memory (which is why one usually needs to set permissions).

microsoft / mimalloc

mimalloc slower than glibc 2.30 #293