Open Lnd-stoL opened 3 years ago
I see up to a 20% performance drop in v2 vs v1 as well, but on Windows (I haven't benchmarked the two on Linux). I can profile it if that's useful.
Ah very interesting -- thanks for the benchmark. v2 does behave better with regard to fragmentation (especially on large workloads) but it also uses different heuristics to allow the OS to reuse (virtual memory) pages in the (virtual) address space. This is great for reducing (real) fragmentation and working set, but it looks like there is something wrong in the heuristicts/strategy and it is great you were able to isolate this. I will get back later this week on this issue.
UPD: setting huge enough MIMALLOC_RESERVE_OS_MEMORY fixes the slowdown. I'm not sure how mimalloc actually handles "pre reserved" memory but somehow it helps. Any news here?
@Lnd-stoL, I worked on this more over the past days, here are some thoughts:
Just to add perspective, the benchmark is of course atypical as it allocates large buffers without using them. Most (all?) allocators make the assumption that larger allocations can be slower than small ones as it is assumed you will do more work with them. (I suspect adding actual computations on all elements in the buffers may already make the benchmark more equal as the relative work done in allocation becomes less).
Why is v1.7.x faster than v2.0.x. ? The benchmark happens to only allocate large buffers (> 5MiB up to 25 MiB). On v1.7.x there are region
s of (virtual) 256MiB that mimalloc manages to avoid expensive eOS calls and the buffers happen to fit in this and thus it is fast. On the v2.0.x there are no more regions but instead there is a direct cache of segment
s. -- unfortunately, the big buffers do not fit in a segment and thus get directly allocated/freed by the OS every time which is thus slower.
What to do?
MIMALLOC_RESERVE_OS_MEMORY=512m
: this allocates a big arena
of (virtual) 512MiB that is just like a v1.7 region in that it manages that area without needing expensive OS calls and can allocated segments and large buffers from that directly. This is perhaps the best -- generally, for any application that is known to use at least some N amount of memory, it is best to call mi_reserve_os_memory( N, true /*commit*/, true /*allow large pages*/)
at the start of the program. fork
on 32-bit systems.mimalloc-types.h
and change the MI_SEGMENT_SHIFT
from 7 +
to 10 +
. (@jasongibson : if you can repro your perf degradation, it would be interesting to try this as well and let me know if it made a difference). ). The drawback of this is that it increases virtual memory usage per-thread as each thread gets at least one segment so this would not be ideal on 32-bit systems.I will do more testing with upfront reservation and larger segments on some large scale services and see how things go.
Thanks for taking a look at this and giving these suggestions.
We're benchmarking them and will post the results next week.
Here are some results for the MI_SEGMENT_SLICE_SHIFT option. This is with dev-slice@725fe2ac7d7e767caeeb7eb82af1729fa90cc5c2.
The host process using mimalloc here is a server process that answers client requests, forking on Linux and threading on Windows. Most requests are short lived and the underlying forked process/thread is not reused. Most allocations are small and frequent.
So, 2.0+MI_SEGMENT_SLICE_SHIFT is fine speed-wise, and in-between 1.7/2.0 for memory usage.
We have not yet evaluated MIMALLOC_RESERVE_OS_MEMORY, but will do so.
@jasongibson wow, that is great info! I have not yet found a program that I can run that clearly shows the differences in a reproducible way; I need to think a bit about these results but my initial thoughts are:
great to see that larger segments made a difference -- before I was not sure what would be a possible cause.
I see the memory usage is also in-between 1.7 and plain 2.0; but I think that is probably caused due to the granularity of the decommit parts due to the larger segments. This can be addressed -- I will actually try to do this and perhaps we can then test again and see if it improves things.
Do you know if there are many allocations of blocks with a size larger than ~2MIB ? Speedup might be due to the larger blocks fitting in the segments (as in the test program that started this thread), but perhaps we see a speedup as there is more chance a segment is in the cache (and we should just enlarge the segment cache size).
mimalloc tries to give memory back to the OS by "resetting" it; on Windows this actually decommits by default which is why you can see the RSS going down. (On Linux mimalloc uses other mechanisms which can be "lazy" (like MADV_FREE
) meaning it is not so easy to measure how much is truly in use). I cannot readily explain though why the larger segment shift causes it to not drop to zero as well.. I need to think more about this.
Thanks!
Thanks for the explanation.
one option is to do nothing as this may never be exhibited by "real" workloads
Just to say it, we actually faced the issue in a pretty "real" workload) The posted code is just a minimal repro example.
Another one is to mimic v1.7 by reserving OS memory explicitly
This works for me, but is only applicable when needed memory amount is known beforehand.
Another option is to increase the segment size
Tuning segment size really fixes the issue and is actually what I guessed to do) But it also lead to more RSS pressure. Not critical but unpleasant. So
This can be addressed -- I will actually try to do this and perhaps we can then test again and see if it improves things.
these are great news.
BTW, could you please explain what exactly "segment cache" is for and what are the cases when it is profitable to increase its size.
Hi, I think this issue is fixed now in version v2.0.3 -- I hope this is also measurable in realworld workloads; let me know how it goes.
@Lnd-stoL: I liked your benchmark and added the code as a malloc-large
benchmark in the mimalloc-bench repository attributed to you. Let me know if you are ok with this and if not, I will remove it ASAP.
Hi!
Hi, I think this issue is fixed now in version v2.0.3 -- I hope this is also measurable in realworld workloads; let me know how it goes.
Will check it.
Let me know if you are ok with this and if not, I will remove it ASAP.
This is absolutely ok)
I'll report back as well.
We finished our testing, and the 2.0.3 changes look good!
Given the lower memory retention in 2.0.3, overall performance wins on Linux, and general parity on Windows, we've got no reason not to upgrade from the 1.7 line to 2.x, and this ticket is fixed from our perspective. Lookin' forward to 2.x being the official version.
Thank you!
Ah, I missed this earlier, but that is great to hear @jasongibson. Thanks so much for testing.
A simple synthetic allocation/deallocation workload code:
unfortunately runs significantly slower with mimalloc v2.0.2 on my machine (Ubuntu linux, x86_64 CPU) than with default system allocator.
example with system default allocator:
1000 allocations Done in 3366ms. Avg 3366 us per allocation
example with mimalloc:1000 allocations Done in 9161ms. Avg 9161 us per allocation
The issue seems to be caused by lots of mmap/munmap syscalls by mimalloc compared to default allocator:
In such a scenario mimalloc calls mmap even more than once for each malloc call.
The most interesting part is that the issue does not reproduce with stable (v1.7.2) mimalloc, hence the regression.
PS. I realize v2.0.2 is a beta version but hope my case will help it to become stable one day. Beside this issue everything else works perfectly for me:)