Benchmarking and mimalloc

mjp41 commented 5 years ago

Based on mimalloc's design we have managed to dramatically improve the performance on snmalloc.

We should

Rerun our benchmarking infrastructure to compare
- snmalloc submitted to ISMM
- snmalloc current master
- mimalloc
Fully document the ideas, we adopted from the mimalloc paper, and how they impacted performance.

@daanx, @davidchisnall, @plietar, @sylvanc

mjp41 commented 5 years ago

Although the designs between mimalloc and snmalloc were very similar I think we had a different bias in why we had per slab free lists. In snmalloc it was primarily about fragmentation rather than performance. The mimalloc paper, demonstrated that this could lead to amazing performance.

Current things we did taken from mimalloc paper/design

Use a local free list per allocator per size class. #62 We move a per slab free list into the allocator state, such that is allocates from this first. Several bits were improved to make this fast path better,
- no longer using 16bit slab relative addressing
- putting the free list structure at the start of the allocator to improve code gen
- lots of refactoring to make the code sequence small with studious inline/noinline hints.
bump allocate a whole 4KiB worth of allocation into a free list, so we hit the free list path more regularly, mimalloc does the whole 64KiB in one go, and the strategy come directly from that. #62
Table lookup for sizeclass, although the code was very efficient for calculating the sizeclass, the codegen for a table lookup is much beter. #62
TLS init - mimalloc has a great trick of having a default allocator that will always go to the slow path, on that slow path, we can check if we are using the default allocator, and actually initialise a new one if we are. We adopted that in #64.
Uses a queue of slabs with free lists. In snmalloc, we initially used a stack of slabs, this was aimed at trying to reduce fragmentation. However, for some benchmarks/applications this means we would leave the fast path very regularly. mimalloc aims to pick slabs which it hasn't looked at recently to grab larger free lists, and hence stay on the fast path. This change did increase snmalloc's memory usage a little, but not considerably. #65

SchrodingerZhu commented 4 years ago

I am thinking about creating a benchmark website and update the results periodically (like some performance analysis website for the web frameworks). The mimalloc-bench toolsets may be a good point to start with, I can add some visualization tools for it and then create a docker that contains all allocators. Hopefully, I can also add the data visualization module direclt into a page generator so that I can generate the report easily.

daanx commented 4 years ago

Hi @SchrodingerZhu , if you are planning to do that, I do think the mimalloc-bench is a great starting point (and I can give you some javascript+latex that I use to generate nice graphs; perhaps a good initial project would be to take those scripts and extend it to automate the generation of the graphs?).

ps: I don't want to be discouraging, but I would also say that I myself would be quite hesitant about such live allocator benchmark -- it may well be detrimental in the end to allocator performance in general. Let me expand: The problem with allocators is that they need to deal with a wide range of situations and applications; but most benchmarks as they are now are just testing particular aspects and the results must be interpreted with much care; you can't just say one allocator is better because it is faster on some benchmarks. For example, for mimalloc the reduced latency (worst-case allocation times) is very important but none of the benchmarks test that (now). Similarly, hoard has many hard theoretical bounds on edge cases which makes it sometimes a bit slower in general case but it is great if you happen to be in that edge case, or mesh does heap compacting for long running processes which is not reflected, snmalloc cares much about moving objects between threads etc. There is a real risk that putting out an "easy" set of benchmarks makes people over-optimize for the specific mini benchmarks to the detriment of "real" applications whose behavior is generally more difficult to benchmark. Again, for mimalloc for example some companies use it with huge-page memory for large services with 100's of gigs of memory and I saw some other allocators stumble on that, but it is very hard to benchmark these kind of situations more generally; just even running such services is very difficult.

(Now, of course, I am already guilty myself by putting mimalloc-bench out there, but I tend to see it more as a handy toolkit and starting point for allocator evaluation. That's why I try to put always a lot of discussion and warnings about interpreting the results around the graphs.)

Anyway, just my 2c.

mjp41 commented 4 years ago

@schrodingerZhu I think it would be great to have something like this. But I also share Daan's (@daanx) worry.

Personally, I use the micro-benchmarks for finding problems rather than justifying superiority. I have been using mimalloc-bench for the last two weeks in preparation for a release of snmalloc. I found four places where perf was significantly different to mimalloc. When I studied these I found issues in the code or benchmark. Leading to several of the recent PRs.

xmalloc-testN - Showed poor performance in the message queue handling due to small batch size (#158 )
mstressN - Showed leak of allocators during TLS teardown (#161)
Larson - Benchmark does not match new[] with delete[], leading to incorrect size of returned object (#160 defensive check)
rptest - showed lots of page faults from using MADV_DONTNEED for zeroing (#159)

I would never want to do a parameter sweep to micro-optimise against this benchmark suite, but if you can justify the change independent of the benchmark, then I think it is great to let them find issues. You should also check why you are faster, not just the cases where you are slower.

Perhaps, to help address Daan's concerns places where allocator are targeted at should have specific benchmarks built for them. So perhaps once we have a live site doing this. Working out how to demonstrate latency in a new benchmark would be really interesting. My change #158 that improves xmalloc-testN may harm latency, but perhaps not enough to notice.

Also, at least allocators can be switched in real applications, so that can then address some of the larger scale benchmarking.

I think something like Tech Empower benchmarks has been hugely influential in making web servers faster.

SchrodingerZhu commented 4 years ago

work in progress at: https://github.com/SchrodingerZhu/bench_suite Haven't found a good example for Agda (though checking IO takes several seconds).

Though a preference of general performance can be useful, I agree that it is unfair for some allocators. (for instance, hardened_malloc performs slower than others on my side; but performance is not its goal; In rust environment, there is also something similar: wee_alloc is developed to reduce the wasm target size, but the performance issue is obvious. After all, having a benchmark has some benefits, it can at least help users to check the performance and provides a reference for them to make their choices).

SchrodingerZhu commented 4 years ago

current output can be found at: https://github.com/SchrodingerZhu/bench_suite/blob/master/output/index.md. Notice that these results are generated on my laptop and hence can be inaccurate.

@mjp41 It seems that snmalloc is having some higher memory peak in some cases.

mjp41 commented 4 years ago

Does the machine you run on have transparent hugh pages?

You can check with

cat /sys/kernel/mm/transparent_hugepage/enabled

If it is set to always, I have found snmalloc can have a very high memory usage as the way we request memory makes the system decide to use hugh pages, and then we may not use much of them.

You can disable it with:

echo advise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Not sure what the correct setting is to have.

SchrodingerZhu commented 4 years ago

rust related simple benchmarks are added. snmalloc and mimalloc performs quite good.

microsoft / snmalloc

Benchmarking and mimalloc #67