Open mjp41 opened 3 years ago
Hey!
For our CCS binary we are ricing like hell, as every percentage performance is actual money saved and reduces time to result for future clinical samples. Who wouldn't like to get the results of a rare disease screen as fast as possible? Surely this is future music, but we are thinking big here :)
Here's a little bit about my toolchain:
-ffast-math
march=nehalem
LDFLAGS="-static"
The easiest for me to check if mimalloc is linked properly, I can use one of the verbose env variables :)
I can deactivate PGO and then run perf.
If you're statically linking then, before you strip the binary, running nm
and seeing if you see any snmalloc symbols will let you see if you've pulled it in (which doesn't guarantee that it's the malloc
that's actually being used, but will at least ensure that it's there).
Are you including snmalloc in the PGO / LTO bundle? Our fast path is around 15 instructions for malloc (one more when compiled with GCC than clang) and so may end up inlined if you're allocating on a hot path.
Are you including snmalloc in the PGO / LTO bundle?
yes
I'll test nm
and let you know.
Statically linked w/PGO<O&GLIBC2.32 on 2x7H12: | Run | Wall time | CPU Time | Peak RSS |
---|---|---|---|---|
MIMALLOC Tuned* | 4m 3s | 15h 54m | 25.9 GB | |
MIMALLOC | 4m 32s | 16h 35m | 26.6 GB | |
SNMALLOC 16MB | 5m 4s | 14h 37m | 30.6 GB | |
SNMALLOC 1MB | 5m 8s | 14h 42m | 30.7 GB | |
GLIBC 2.32 | 5m 19s | 15h 44m | 30.9 GB |
Dynamically linked w/GLIBC2.17 on 2x7742: | Run | Wall time | CPU Time | Peak RSS |
---|---|---|---|---|
MIMALLOC Tuned* | 5m 12s | 21h 25m | 30.2 GB | |
SNMALLOC 16MB | 5m 21s | 22h 4m | 29.9 GB | |
SNMALLOC 1MB | 5m 21s | 22h 4m | 30.0 GB | |
GLIBC 2.32 | 7m 38s | 16h 57m | 23.9 GB |
*MIMALLOC_PAGE_RESET=0 MIMALLOC_LARGE_OS_PAGES=1
It's obviously using snmalloc, as default malloc is really slow for 256 threads running at full throttle.
So given we are spending less CPU time, it suggests to me that we are spending too much time in the kernel holding a lock or something. There is one bit of code I can think of that could be causing that: https://github.com/microsoft/snmalloc/blob/923705e514e850ee0070cef470fca9d9c47ca3da/src/pal/pal_linux.h#L44-L55
I wonder if given the size of your machine, that this IPI is costing too much. If you set
-DSNMALLOC_QEMU_WORKAROUND=On
then it will disable that call (badly named flag, based on a previous issue we had).
I guess the other possibility is that mimalloc has custom code for NUMA, which we don't have. That could also be a source of slow down, but fixing that would be much harder.
P.S. Thank you for collecting and submitting the benchmark results.
Tuning that has been on my to-do list for a while. The 16 page minimum looks quite plausible, but on x86 in particular it may require some IPIs and the cost of MADV_DONTNEED
will scale in some polynomial on the number of cores, whereas the cost of memset
is linear with the size of the allocation. The intersection of those is going to be complicated. I'd like to write a little microbenchmark that can try to figure out the inflection point for a given processor that we can (optionally?) run from CMake and that we can use on a bunch of different machines to pick a number based on the number of cores if we don't have a magic number for the current machine.
Is the CPU time just user, or user + system?
In the dynamically linked version, we're using a bit more than 3% more CPU than mimalloc and a bit less than 3% more wall clock time. There's a much bigger difference in the PGO + LTO version. I wonder if we're seeing out fast path being inlined more than it should be and pushing more things out of i-cache? It's surprising that we'd use so much less (8%) CPU time but a lot more (25%!) wall clock time, as @mjp41 says, that implies that we're holding a kernel lock somewhere and just not running for a bit.
@armintoepfer if you have page fault statistics available that might also suggest what is going on. Do you have transparent huge pages switched on on the machine? I think the following should say if you do
cat /sys/kernel/mm/transparent_hugepage/enabled
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Following all static, PGO/LTO builds on 2x7H12 systems. | Metric | SN 16 QEMU=On | SN 16 | MI Tuned | MI def |
---|---|---|---|---|---|
User time (seconds) | 53000.29 | 52938.56 | 57330.39 | 58802.21 | |
System time (seconds) | 339.14 | 357.15 | 63.45 | 990.52 | |
Percent of CPU this job got | 16850% | 16737% | 21999% | 20509% | |
Elapsed (wall clock) time (h:mm:ss or m:ss) | 5:16.53 | 5:18.42 | 4:20.88 | 4:51.53 | |
Average shared text size (kbytes) | 0 | 0 | 0 | 0 | |
Average unshared data size (kbytes) | 0 | 0 | 0 | 0 | |
Average stack size (kbytes) | 0 | 0 | 0 | 0 | |
Average total size (kbytes) | 0 | 0 | 0 | 0 | |
Maximum resident set size (kbytes) | 30652060 | 31385804 | 26211960 | 28317732 | |
Average resident set size (kbytes) | 0 | 0 | 0 | 0 | |
Major (requiring I/O) page faults | 1 | 1 | 1 | 1 | |
Minor (reclaiming a frame) page faults | 22091749 | 21991651 | 487583 | 20438894 | |
Voluntary context switches | 13798271 | 13884004 | 1468104 | 1773052 | |
Involuntary context switches | 868023 | 888361 | 484109 | 718255 | |
Swaps | 0 | 0 | 0 | 0 | |
File system inputs | 23938776 | 23938776 | 23938920 | 23938920 | |
File system outputs | 1842072 | 1842096 | 1842232 | 1842176 | |
Socket messages sent | 0 | 0 | 0 | 0 | |
Socket messages received | 0 | 0 | 0 | 0 | |
Signals delivered | 0 | 0 | 0 | 0 | |
Page size (bytes) | 4096 | 4096 | 4096 | 4096 | |
Exit status | 0 | 0 | 0 | 0 |
@armintoepfer thanks for this information. I have not seen this level of difference in page faults before. I will run some benchmarks with the MIMALLOC_LARGE_OS_PAGES=1
setting, to try to dig into this.
Reading the mimalloc code, they are grabbing using MAP_HUGE_TLB
at the 1GiB
granularity and falling back to 2MiB
is that fails, and then falling back to transparent huge pages (THP) if that does not work. Where as snmalloc is just using assuming THP will do a good job. I will see what I can implement to improve this.
@armintoepfer, thanks for taking the time to try snmalloc, and providing us so useful feedback.
Sure thing. But even without MIMALLOC_LARGE_OS_PAGES
, mimalloc
is much faster than snmalloc
per default. So there is something beyond pages that is not working out in my use case. Ping me in case you have something to benchmark.
@armintoepfer that is a very good point. It would interesting to know the page fault number for the non-tuned version.
Updated.
The things that leap out to me are (and perhaps these are obvious to other observers; if so, sorry for the spam):
Why is snmalloc
doing an order of magnitude more voluntary context switches than mimalloc
? snmalloc
doesn't make that many system calls; are we perhaps contending on the address space lock in the kernel somehow?
It looks like mimalloc
can tune itself down to experiencing only 2% of the minor page faults experienced by the other configurations. What's going on there?
@nwf yeah, I think mimalloc is probably using 1GiB huge pages in that configuration, so that is why there are very few page faults when it is tuned.
With regard to the context switches, that is a good question. I have found that on some microbenchmarks, if I put in a large thread count, then I can replicate the statistics that @armintoepfer has. But it is unclear to me if that is coincidence or actually replicating the issue.
@armintoepfer are you using the same number of OS threads as hardware threads in your application?
One concern I have, if there are few places we have used spinning on infrequently accessed data structures, when we get to 256 threads, I am concerned this might no longer be the correct choice. I have created a branch that monitors the time spent in these bits.
https://github.com/microsoft/snmalloc/tree/mini_stats
@armintoepfer if you would be able to run this in at send me the output, that would be amazing. Should be something like:
MPMC Push:
Times: 2
Time spend waiting: 1338 cycles
Max wait: 1192 cycles
MPMC Push:
Times: 4
Time spend waiting: 1176 cycles
Max wait: 382 cycles
MPMC Pop:
Times: 7
Time spend waiting: 3748 cycles
Max wait: 1128 cycles
MPMC Pop:
Times: 2
Time spend waiting: 556 cycles
Max wait: 368 cycles
FlagLock:
Times: 11
Time spend waiting: 948 cycles
Max wait: 248 cycles
But the numbers will be much bigger on your machine.
Second, @armintoepfer for splitting the work up, do you uniformly split it at the start across the threads, and then wait for them to all terminate, or do you have a more work stealing like approach?
are you using the same number of OS threads as hardware threads in your application?
Correct. I'm using the same number of compute heavy threads as I have (logical) cores available; logical because of hyper-threading. I have additional 2x#cores threads for very lightweight output writing.
for splitting the work up, do you uniformly split it at the start across the threads, and then wait for them to all terminate, or do you have a more work stealing like approach?
It's a custom threadpool that I call FireAndForget
. There is one producer, one orchestrator, and N workers. The orchestrator distributes work such that every worker is always busy. The Forget
part about that is, there is no feedback from the worker back to the orchestrator. The worker writes output on its own and stores additional metadata locally, which will be merged right before joining threads. This leads to having no bottleneck, except getting enough data to the workers.
if you would be able to run this in at send me the output, that would be amazing.
I did, but I don't get any output from snmalloc using static linkage with PGO/LTO.
Rebuilt and linked dynamically via LD_PRELOAD
and libsnmallocshim-16mib.so
MPMC Push:
Times: 2080
Time spend waiting: 344006 cycles
Max wait: 4056 cycles
MPMC Push:
Times: 3211647
Time spend waiting: 1227836194 cycles
Max wait: 530608 cycles
MPMC Pop:
Times: 3225056
Time spend waiting: 2753547577 cycles
Max wait: 68954808 cycles
MPMC Pop:
Times: 2081
Time spend waiting: 315796 cycles
Max wait: 3172 cycles
FlagLock:
Times: 20543
Time spend waiting: 4008420 cycles
Max wait: 748410 cycles
@armintoepfer Thanks. for running this. It does suggest there is opportunity for optimisation in one case, it is no where near enough to account for the difference.
I will try to work out why the printing does not work for the lto version. As that was the one that showed the difference in a considerably more pronounced way.
This issue is to track the poor performance reported by @armintoepfer on twitter, when comparing to mimalloc.
https://twitter.com/XLR/status/1310687339887423489
@armintoepfer, firstly thanks for trying snmalloc. My normal approach to checking if it has been correctly loaded is to run a profile, and see if any snmalloc symbols appear in the trace. If you could run
perf
on a benchmarking run and share any profiling data about the snmalloc and libc symbols, then we might be able to spot what is happening.How are you loading it? LD_PRELOAD or are you linking the static library? The LD_PRELOAD approach is much better tested. The static library has only been used by the LLVM Windows patch so far.
Is the performance you are seeing similar to the system allocator? This would suggest we haven't intercepted the calls correctly.