v2 performance regression in heavy allocation scenario (a lot of syscalls)

Lnd-stoL commented 3 years ago

A simple synthetic allocation/deallocation workload code:

int main(int, char*[]) {
  static constexpr int kNumBuffers = 20;
  static constexpr size_t kMinBufferSize = 5 * 1024 * 1024;
  static constexpr size_t kMaxBufferSize = 25 * 1024 * 1024;
  std::unique_ptr<char[]> buffers[kNumBuffers];

  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<> size_distribution(kMinBufferSize, kMaxBufferSize);
  std::uniform_int_distribution<> buf_number_distribution(0, kNumBuffers - 1);

  static constexpr int kNumIterations = 1000;
  const auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < kNumIterations; ++i) {
    int buffer_idx = buf_number_distribution(gen);
    size_t new_size = size_distribution(gen);
    buffers[buffer_idx] = std::make_unique<char[]>(new_size);
  }
  const auto end = std::chrono::steady_clock::now();
  const auto num_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  const auto us_per_allocation = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / kNumIterations;
  std::cout << kNumIterations << " allocations Done in " << num_ms << "ms." << std::endl;
  std::cout << "Avg " << us_per_allocation << " us per allocation" << std::endl;
  return 0;
}

unfortunately runs significantly slower with mimalloc v2.0.2 on my machine (Ubuntu linux, x86_64 CPU) than with default system allocator.

example with system default allocator: 1000 allocations Done in 3366ms. Avg 3366 us per allocation example with mimalloc: 1000 allocations Done in 9161ms. Avg 9161 us per allocation

The issue seems to be caused by lots of mmap/munmap syscalls by mimalloc compared to default allocator:

**perf trace -s bazel-bin/third_party/mimalloc/mimalloc_test**
10000 allocations Done in 488ms.
Avg 48 us per allocation

 Summary of events:
 mimalloc_test (19177), 123062 events, 100.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   munmap             36680   117.243     0.001     0.003     0.052      0.35%
   mmap               19973    44.567     0.002     0.002     0.032      0.23%
   madvise             4724     7.485     0.001     0.002     0.003      0.21%

**perf trace -s bazel-bin/third_party/mimalloc/stdalloc_test**
10000 allocations Done in 23ms.
Avg 2 us per allocation

 Summary of events:
 stdalloc_test (19210), 1400 events, 99.3%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   brk                  506     4.638     0.002     0.009     1.376     29.83%
   openat                46     0.290     0.004     0.006     0.018      7.19%
   mmap                  43     0.204     0.003     0.005     0.009      5.86%
   munmap                27     0.203     0.005     0.008     0.016      5.48%

In such a scenario mimalloc calls mmap even more than once for each malloc call.

The most interesting part is that the issue does not reproduce with stable (v1.7.2) mimalloc, hence the regression.

PS. I realize v2.0.2 is a beta version but hope my case will help it to become stable one day. Beside this issue everything else works perfectly for me:)

jasongibson commented 3 years ago

I see up to a 20% performance drop in v2 vs v1 as well, but on Windows (I haven't benchmarked the two on Linux). I can profile it if that's useful.

daanx commented 3 years ago

Ah very interesting -- thanks for the benchmark. v2 does behave better with regard to fragmentation (especially on large workloads) but it also uses different heuristics to allow the OS to reuse (virtual memory) pages in the (virtual) address space. This is great for reducing (real) fragmentation and working set, but it looks like there is something wrong in the heuristicts/strategy and it is great you were able to isolate this. I will get back later this week on this issue.

Lnd-stoL commented 3 years ago

UPD: setting huge enough MIMALLOC_RESERVE_OS_MEMORY fixes the slowdown. I'm not sure how mimalloc actually handles "pre reserved" memory but somehow it helps. Any news here?

daanx commented 3 years ago

@Lnd-stoL, I worked on this more over the past days, here are some thoughts:

Just to add perspective, the benchmark is of course atypical as it allocates large buffers without using them. Most (all?) allocators make the assumption that larger allocations can be slower than small ones as it is assumed you will do more work with them. (I suspect adding actual computations on all elements in the buffers may already make the benchmark more equal as the relative work done in allocation becomes less).
Why is v1.7.x faster than v2.0.x. ? The benchmark happens to only allocate large buffers (> 5MiB up to 25 MiB). On v1.7.x there are regions of (virtual) 256MiB that mimalloc manages to avoid expensive eOS calls and the buffers happen to fit in this and thus it is fast. On the v2.0.x there are no more regions but instead there is a direct cache of segments. -- unfortunately, the big buffers do not fit in a segment and thus get directly allocated/freed by the OS every time which is thus slower.

What to do?

Well, one option is to do nothing as this may never be exhibited by "real" workloads (which use their buffers).
Another one is to mimic v1.7 by reserving OS memory explicitly as you did with MIMALLOC_RESERVE_OS_MEMORY=512m: this allocates a big arena of (virtual) 512MiB that is just like a v1.7 region in that it manages that area without needing expensive OS calls and can allocated segments and large buffers from that directly. This is perhaps the best -- generally, for any application that is known to use at least some N amount of memory, it is best to call mi_reserve_os_memory( N, true /*commit*/, true /*allow large pages*/) at the start of the program.
One option I am considering is to always do this by default and reserve 1GiB up front by default; however, I would need to think a bit more on the implications on systems without overcommit (like BSD) and how that would interact with fork on 32-bit systems.
Another option is to increase the segment size; in v2.0 it is now 8MiB but increasing it to 64MiB allows it to contain 32MiB buffers and thus allow caching for those. Larger segments are also good for memory fragmentation. (You can try this by going into mimalloc-types.h and change the MI_SEGMENT_SHIFT from 7 + to 10 +. (@jasongibson : if you can repro your perf degradation, it would be interesting to try this as well and let me know if it made a difference). ). The drawback of this is that it increases virtual memory usage per-thread as each thread gets at least one segment so this would not be ideal on 32-bit systems.

I will do more testing with upfront reservation and larger segments on some large scale services and see how things go.

jasongibson commented 3 years ago

Thanks for taking a look at this and giving these suggestions.

We're benchmarking them and will post the results next week.

jasongibson commented 3 years ago

Here are some results for the MI_SEGMENT_SLICE_SHIFT option. This is with dev-slice@725fe2ac7d7e767caeeb7eb82af1729fa90cc5c2.

The host process using mimalloc here is a server process that answers client requests, forking on Linux and threading on Windows. Most requests are short lived and the underlying forked process/thread is not reused. Most allocations are small and frequent.

Linux speed is about the same in version 1.7, 2.0, and 2.0+MI_SEGMENT_SLICE_SHIFT.
Windows speed is narrowly better in version 1.7, but 2.0+MI_SEGMENT_SLICE_SHIFT is pretty close - within a couple percent overall and winning in places. Plain 2.0 is at most 15% slower than 1.7, 6% slower in others, and the rest having it neck-and-neck.
Windows memory usage is best in plain 2.0 (drops to zero when not in use), worst in 1.7 (retains the process high water mark), and in the middle with 2.0+MI_SEGMENT_SLICE_SHIFT where the process memory does drop about halfway when load on the server stops (e.g. from ~45GB down to 25 of total commit memory).

So, 2.0+MI_SEGMENT_SLICE_SHIFT is fine speed-wise, and in-between 1.7/2.0 for memory usage.

We have not yet evaluated MIMALLOC_RESERVE_OS_MEMORY, but will do so.

daanx commented 3 years ago

@jasongibson wow, that is great info! I have not yet found a program that I can run that clearly shows the differences in a reproducible way; I need to think a bit about these results but my initial thoughts are:

great to see that larger segments made a difference -- before I was not sure what would be a possible cause.
I see the memory usage is also in-between 1.7 and plain 2.0; but I think that is probably caused due to the granularity of the decommit parts due to the larger segments. This can be addressed -- I will actually try to do this and perhaps we can then test again and see if it improves things.
Do you know if there are many allocations of blocks with a size larger than ~2MIB ? Speedup might be due to the larger blocks fitting in the segments (as in the test program that started this thread), but perhaps we see a speedup as there is more chance a segment is in the cache (and we should just enlarge the segment cache size).
mimalloc tries to give memory back to the OS by "resetting" it; on Windows this actually decommits by default which is why you can see the RSS going down. (On Linux mimalloc uses other mechanisms which can be "lazy" (like MADV_FREE) meaning it is not so easy to measure how much is truly in use). I cannot readily explain though why the larger segment shift causes it to not drop to zero as well.. I need to think more about this.

Thanks!

Lnd-stoL commented 3 years ago

Thanks for the explanation.

one option is to do nothing as this may never be exhibited by "real" workloads

Just to say it, we actually faced the issue in a pretty "real" workload) The posted code is just a minimal repro example.

Another one is to mimic v1.7 by reserving OS memory explicitly

This works for me, but is only applicable when needed memory amount is known beforehand.

Another option is to increase the segment size

Tuning segment size really fixes the issue and is actually what I guessed to do) But it also lead to more RSS pressure. Not critical but unpleasant. So

This can be addressed -- I will actually try to do this and perhaps we can then test again and see if it improves things.

these are great news.

BTW, could you please explain what exactly "segment cache" is for and what are the cases when it is profitable to increase its size.

daanx commented 2 years ago

Hi, I think this issue is fixed now in version v2.0.3 -- I hope this is also measurable in realworld workloads; let me know how it goes.

@Lnd-stoL: I liked your benchmark and added the code as a malloc-large benchmark in the mimalloc-bench repository attributed to you. Let me know if you are ok with this and if not, I will remove it ASAP.

Lnd-stoL commented 2 years ago

Hi!

Hi, I think this issue is fixed now in version v2.0.3 -- I hope this is also measurable in realworld workloads; let me know how it goes.

Will check it.

Let me know if you are ok with this and if not, I will remove it ASAP.

This is absolutely ok)

jasongibson commented 2 years ago

I'll report back as well.

jasongibson commented 2 years ago

We finished our testing, and the 2.0.3 changes look good!

Windows memory usage at idle has halfed - ~45GB peak during activity, down to 11 of total commit memory at idle (previously retaining 25GB idle).
Windows speed is neck and neck with the 1.7 branch, with only one part of the benchmark still being slower, and this one is now twice as fast as it was with the previous 2.x version (+~8%, down from +16%).
Linux is usually faster in 2.0.3 now.
Most differences in performance are now within a percent or two.

Given the lower memory retention in 2.0.3, overall performance wins on Linux, and general parity on Windows, we've got no reason not to upgrade from the 1.7 line to 2.x, and this ticket is fixed from our perspective. Lookin' forward to 2.x being the official version.

Thank you!

daanx commented 2 years ago

Ah, I missed this earlier, but that is great to hear @jasongibson. Thanks so much for testing.

microsoft / mimalloc

v2 performance regression in heavy allocation scenario (a lot of syscalls) #447