microsoft / snmalloc

Message passing based allocator
MIT License
1.55k stars 108 forks source link

Benchmarking and mimalloc #67

Open mjp41 opened 5 years ago

mjp41 commented 5 years ago

Based on mimalloc's design we have managed to dramatically improve the performance on snmalloc.

We should

@daanx, @davidchisnall, @plietar, @sylvanc

mjp41 commented 5 years ago

Although the designs between mimalloc and snmalloc were very similar I think we had a different bias in why we had per slab free lists. In snmalloc it was primarily about fragmentation rather than performance. The mimalloc paper, demonstrated that this could lead to amazing performance.

Current things we did taken from mimalloc paper/design

SchrodingerZhu commented 4 years ago

I am thinking about creating a benchmark website and update the results periodically (like some performance analysis website for the web frameworks). The mimalloc-bench toolsets may be a good point to start with, I can add some visualization tools for it and then create a docker that contains all allocators. Hopefully, I can also add the data visualization module direclt into a page generator so that I can generate the report easily.

daanx commented 4 years ago

Hi @SchrodingerZhu , if you are planning to do that, I do think the mimalloc-bench is a great starting point (and I can give you some javascript+latex that I use to generate nice graphs; perhaps a good initial project would be to take those scripts and extend it to automate the generation of the graphs?).

ps: I don't want to be discouraging, but I would also say that I myself would be quite hesitant about such live allocator benchmark -- it may well be detrimental in the end to allocator performance in general. Let me expand: The problem with allocators is that they need to deal with a wide range of situations and applications; but most benchmarks as they are now are just testing particular aspects and the results must be interpreted with much care; you can't just say one allocator is better because it is faster on some benchmarks. For example, for mimalloc the reduced latency (worst-case allocation times) is very important but none of the benchmarks test that (now). Similarly, hoard has many hard theoretical bounds on edge cases which makes it sometimes a bit slower in general case but it is great if you happen to be in that edge case, or mesh does heap compacting for long running processes which is not reflected, snmalloc cares much about moving objects between threads etc. There is a real risk that putting out an "easy" set of benchmarks makes people over-optimize for the specific mini benchmarks to the detriment of "real" applications whose behavior is generally more difficult to benchmark. Again, for mimalloc for example some companies use it with huge-page memory for large services with 100's of gigs of memory and I saw some other allocators stumble on that, but it is very hard to benchmark these kind of situations more generally; just even running such services is very difficult.

(Now, of course, I am already guilty myself by putting mimalloc-bench out there, but I tend to see it more as a handy toolkit and starting point for allocator evaluation. That's why I try to put always a lot of discussion and warnings about interpreting the results around the graphs.)

Anyway, just my 2c.

mjp41 commented 4 years ago

@schrodingerZhu I think it would be great to have something like this. But I also share Daan's (@daanx) worry.

Personally, I use the micro-benchmarks for finding problems rather than justifying superiority. I have been using mimalloc-bench for the last two weeks in preparation for a release of snmalloc. I found four places where perf was significantly different to mimalloc. When I studied these I found issues in the code or benchmark. Leading to several of the recent PRs.

I would never want to do a parameter sweep to micro-optimise against this benchmark suite, but if you can justify the change independent of the benchmark, then I think it is great to let them find issues. You should also check why you are faster, not just the cases where you are slower.

Perhaps, to help address Daan's concerns places where allocator are targeted at should have specific benchmarks built for them. So perhaps once we have a live site doing this. Working out how to demonstrate latency in a new benchmark would be really interesting. My change #158 that improves xmalloc-testN may harm latency, but perhaps not enough to notice.

Also, at least allocators can be switched in real applications, so that can then address some of the larger scale benchmarking.

I think something like Tech Empower benchmarks has been hugely influential in making web servers faster.

SchrodingerZhu commented 4 years ago

work in progress at: https://github.com/SchrodingerZhu/bench_suite Haven't found a good example for Agda (though checking IO takes several seconds).

Though a preference of general performance can be useful, I agree that it is unfair for some allocators. (for instance, hardened_malloc performs slower than others on my side; but performance is not its goal; In rust environment, there is also something similar: wee_alloc is developed to reduce the wasm target size, but the performance issue is obvious. After all, having a benchmark has some benefits, it can at least help users to check the performance and provides a reference for them to make their choices).

SchrodingerZhu commented 4 years ago

current output can be found at: https://github.com/SchrodingerZhu/bench_suite/blob/master/output/index.md. Notice that these results are generated on my laptop and hence can be inaccurate.

@mjp41 It seems that snmalloc is having some higher memory peak in some cases.

mjp41 commented 4 years ago

Does the machine you run on have transparent hugh pages?

You can check with

cat /sys/kernel/mm/transparent_hugepage/enabled

If it is set to always, I have found snmalloc can have a very high memory usage as the way we request memory makes the system decide to use hugh pages, and then we may not use much of them.

You can disable it with:

echo advise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled 

Not sure what the correct setting is to have.

SchrodingerZhu commented 4 years ago

rust related simple benchmarks are added. snmalloc and mimalloc performs quite good.