microsoft / snmalloc

Message passing based allocator
MIT License
1.52k stars 106 forks source link

Poor performance report #251

Open mjp41 opened 3 years ago

mjp41 commented 3 years ago

This issue is to track the poor performance reported by @armintoepfer on twitter, when comparing to mimalloc.

Initial testing, it's much slower than mimalloc. Do you have, similar to mimalloc, env variables to show debug summaries to ensure that your library was properly loaded? Maybe I screwed something up ;)

https://twitter.com/XLR/status/1310687339887423489

@armintoepfer, firstly thanks for trying snmalloc. My normal approach to checking if it has been correctly loaded is to run a profile, and see if any snmalloc symbols appear in the trace. If you could run perf on a benchmarking run and share any profiling data about the snmalloc and libc symbols, then we might be able to spot what is happening.

How are you loading it? LD_PRELOAD or are you linking the static library? The LD_PRELOAD approach is much better tested. The static library has only been used by the LLVM Windows patch so far.

Is the performance you are seeing similar to the system allocator? This would suggest we haven't intercepted the calls correctly.

armintoepfer commented 3 years ago

Hey!

For our CCS binary we are ricing like hell, as every percentage performance is actual money saved and reduces time to result for future clinical samples. Who wouldn't like to get the results of a rare disease screen as fast as possible? Surely this is future music, but we are thinking big here :)

Here's a little bit about my toolchain:

The easiest for me to check if mimalloc is linked properly, I can use one of the verbose env variables :)

I can deactivate PGO and then run perf.

davidchisnall commented 3 years ago

If you're statically linking then, before you strip the binary, running nm and seeing if you see any snmalloc symbols will let you see if you've pulled it in (which doesn't guarantee that it's the malloc that's actually being used, but will at least ensure that it's there).

Are you including snmalloc in the PGO / LTO bundle? Our fast path is around 15 instructions for malloc (one more when compiled with GCC than clang) and so may end up inlined if you're allocating on a hot path.

armintoepfer commented 3 years ago

Are you including snmalloc in the PGO / LTO bundle?

yes

I'll test nm and let you know.

armintoepfer commented 3 years ago
Statically linked w/PGO&LTO&GLIBC2.32 on 2x7H12: Run Wall time CPU Time Peak RSS
MIMALLOC Tuned* 4m 3s 15h 54m 25.9 GB
MIMALLOC 4m 32s 16h 35m 26.6 GB
SNMALLOC 16MB 5m 4s 14h 37m 30.6 GB
SNMALLOC 1MB 5m 8s 14h 42m 30.7 GB
GLIBC 2.32 5m 19s 15h 44m 30.9 GB
Dynamically linked w/GLIBC2.17 on 2x7742: Run Wall time CPU Time Peak RSS
MIMALLOC Tuned* 5m 12s 21h 25m 30.2 GB
SNMALLOC 16MB 5m 21s 22h 4m 29.9 GB
SNMALLOC 1MB 5m 21s 22h 4m 30.0 GB
GLIBC 2.32 7m 38s 16h 57m 23.9 GB

*MIMALLOC_PAGE_RESET=0 MIMALLOC_LARGE_OS_PAGES=1

It's obviously using snmalloc, as default malloc is really slow for 256 threads running at full throttle.

mjp41 commented 3 years ago

So given we are spending less CPU time, it suggests to me that we are spending too much time in the kernel holding a lock or something. There is one bit of code I can think of that could be causing that: https://github.com/microsoft/snmalloc/blob/923705e514e850ee0070cef470fca9d9c47ca3da/src/pal/pal_linux.h#L44-L55

I wonder if given the size of your machine, that this IPI is costing too much. If you set

-DSNMALLOC_QEMU_WORKAROUND=On

then it will disable that call (badly named flag, based on a previous issue we had).

I guess the other possibility is that mimalloc has custom code for NUMA, which we don't have. That could also be a source of slow down, but fixing that would be much harder.

P.S. Thank you for collecting and submitting the benchmark results.

davidchisnall commented 3 years ago

Tuning that has been on my to-do list for a while. The 16 page minimum looks quite plausible, but on x86 in particular it may require some IPIs and the cost of MADV_DONTNEED will scale in some polynomial on the number of cores, whereas the cost of memset is linear with the size of the allocation. The intersection of those is going to be complicated. I'd like to write a little microbenchmark that can try to figure out the inflection point for a given processor that we can (optionally?) run from CMake and that we can use on a bunch of different machines to pick a number based on the number of cores if we don't have a magic number for the current machine.

davidchisnall commented 3 years ago

Is the CPU time just user, or user + system?

In the dynamically linked version, we're using a bit more than 3% more CPU than mimalloc and a bit less than 3% more wall clock time. There's a much bigger difference in the PGO + LTO version. I wonder if we're seeing out fast path being inlined more than it should be and pushing more things out of i-cache? It's surprising that we'd use so much less (8%) CPU time but a lot more (25%!) wall clock time, as @mjp41 says, that implies that we're holding a kernel lock somewhere and just not running for a bit.

mjp41 commented 3 years ago

@armintoepfer if you have page fault statistics available that might also suggest what is going on. Do you have transparent huge pages switched on on the machine? I think the following should say if you do

cat /sys/kernel/mm/transparent_hugepage/enabled
armintoepfer commented 3 years ago
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Following all static, PGO/LTO builds on 2x7H12 systems. Metric SN 16 QEMU=On SN 16 MI Tuned MI def
User time (seconds) 53000.29 52938.56 57330.39 58802.21
System time (seconds) 339.14 357.15 63.45 990.52
Percent of CPU this job got 16850% 16737% 21999% 20509%
Elapsed (wall clock) time (h:mm:ss or m:ss) 5:16.53 5:18.42 4:20.88 4:51.53
Average shared text size (kbytes) 0 0 0 0
Average unshared data size (kbytes) 0 0 0 0
Average stack size (kbytes) 0 0 0 0
Average total size (kbytes) 0 0 0 0
Maximum resident set size (kbytes) 30652060 31385804 26211960 28317732
Average resident set size (kbytes) 0 0 0 0
Major (requiring I/O) page faults 1 1 1 1
Minor (reclaiming a frame) page faults 22091749 21991651 487583 20438894
Voluntary context switches 13798271 13884004 1468104 1773052
Involuntary context switches 868023 888361 484109 718255
Swaps 0 0 0 0
File system inputs 23938776 23938776 23938920 23938920
File system outputs 1842072 1842096 1842232 1842176
Socket messages sent 0 0 0 0
Socket messages received 0 0 0 0
Signals delivered 0 0 0 0
Page size (bytes) 4096 4096 4096 4096
Exit status 0 0 0 0
mjp41 commented 3 years ago

@armintoepfer thanks for this information. I have not seen this level of difference in page faults before. I will run some benchmarks with the MIMALLOC_LARGE_OS_PAGES=1 setting, to try to dig into this.

Reading the mimalloc code, they are grabbing using MAP_HUGE_TLB at the 1GiB granularity and falling back to 2MiB is that fails, and then falling back to transparent huge pages (THP) if that does not work. Where as snmalloc is just using assuming THP will do a good job. I will see what I can implement to improve this.

@armintoepfer, thanks for taking the time to try snmalloc, and providing us so useful feedback.

armintoepfer commented 3 years ago

Sure thing. But even without MIMALLOC_LARGE_OS_PAGES, mimalloc is much faster than snmalloc per default. So there is something beyond pages that is not working out in my use case. Ping me in case you have something to benchmark.

mjp41 commented 3 years ago

@armintoepfer that is a very good point. It would interesting to know the page fault number for the non-tuned version.

armintoepfer commented 3 years ago

Updated.

nwf commented 3 years ago

The things that leap out to me are (and perhaps these are obvious to other observers; if so, sorry for the spam):

mjp41 commented 3 years ago

@nwf yeah, I think mimalloc is probably using 1GiB huge pages in that configuration, so that is why there are very few page faults when it is tuned.

With regard to the context switches, that is a good question. I have found that on some microbenchmarks, if I put in a large thread count, then I can replicate the statistics that @armintoepfer has. But it is unclear to me if that is coincidence or actually replicating the issue.

@armintoepfer are you using the same number of OS threads as hardware threads in your application?

One concern I have, if there are few places we have used spinning on infrequently accessed data structures, when we get to 256 threads, I am concerned this might no longer be the correct choice. I have created a branch that monitors the time spent in these bits.

https://github.com/microsoft/snmalloc/tree/mini_stats

@armintoepfer if you would be able to run this in at send me the output, that would be amazing. Should be something like:

MPMC Push:
  Times: 2
  Time spend waiting: 1338 cycles
  Max wait: 1192 cycles
MPMC Push:
  Times: 4
  Time spend waiting: 1176 cycles
  Max wait: 382 cycles
MPMC Pop:
  Times: 7
  Time spend waiting: 3748 cycles
  Max wait: 1128 cycles
MPMC Pop:
  Times: 2
  Time spend waiting: 556 cycles
  Max wait: 368 cycles
FlagLock:
  Times: 11
  Time spend waiting: 948 cycles
  Max wait: 248 cycles

But the numbers will be much bigger on your machine.

Second, @armintoepfer for splitting the work up, do you uniformly split it at the start across the threads, and then wait for them to all terminate, or do you have a more work stealing like approach?

armintoepfer commented 3 years ago

are you using the same number of OS threads as hardware threads in your application?

Correct. I'm using the same number of compute heavy threads as I have (logical) cores available; logical because of hyper-threading. I have additional 2x#cores threads for very lightweight output writing.

for splitting the work up, do you uniformly split it at the start across the threads, and then wait for them to all terminate, or do you have a more work stealing like approach?

It's a custom threadpool that I call FireAndForget. There is one producer, one orchestrator, and N workers. The orchestrator distributes work such that every worker is always busy. The Forget part about that is, there is no feedback from the worker back to the orchestrator. The worker writes output on its own and stores additional metadata locally, which will be merged right before joining threads. This leads to having no bottleneck, except getting enough data to the workers.

if you would be able to run this in at send me the output, that would be amazing.

I did, but I don't get any output from snmalloc using static linkage with PGO/LTO. Rebuilt and linked dynamically via LD_PRELOAD and libsnmallocshim-16mib.so

MPMC Push:
  Times: 2080
  Time spend waiting: 344006 cycles
  Max wait: 4056 cycles
MPMC Push:
  Times: 3211647
  Time spend waiting: 1227836194 cycles
  Max wait: 530608 cycles
MPMC Pop:
  Times: 3225056
  Time spend waiting: 2753547577 cycles
  Max wait: 68954808 cycles
MPMC Pop:
  Times: 2081
  Time spend waiting: 315796 cycles
  Max wait: 3172 cycles
FlagLock:
  Times: 20543
  Time spend waiting: 4008420 cycles
  Max wait: 748410 cycles
mjp41 commented 3 years ago

@armintoepfer Thanks. for running this. It does suggest there is opportunity for optimisation in one case, it is no where near enough to account for the difference.

I will try to work out why the printing does not work for the lto version. As that was the one that showed the difference in a considerably more pronounced way.

mjp41 commented 3 years ago

@armintoepfer in the LTO/PGO case how are you compiling snmalloc? Are you including malloc.cc and new.cc and adding them to your build? Or are you building the .a and linking that in? Adding some options to get it to use LTO?