microsoft / mimalloc

mimalloc is a compact general purpose allocator with excellent performance.
MIT License
10.57k stars 860 forks source link

v2 performance regression in heavy allocation scenario (a lot of syscalls) #447

Open Lnd-stoL opened 3 years ago

Lnd-stoL commented 3 years ago

A simple synthetic allocation/deallocation workload code:

int main(int, char*[]) {
  static constexpr int kNumBuffers = 20;
  static constexpr size_t kMinBufferSize = 5 * 1024 * 1024;
  static constexpr size_t kMaxBufferSize = 25 * 1024 * 1024;
  std::unique_ptr<char[]> buffers[kNumBuffers];

  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<> size_distribution(kMinBufferSize, kMaxBufferSize);
  std::uniform_int_distribution<> buf_number_distribution(0, kNumBuffers - 1);

  static constexpr int kNumIterations = 1000;
  const auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < kNumIterations; ++i) {
    int buffer_idx = buf_number_distribution(gen);
    size_t new_size = size_distribution(gen);
    buffers[buffer_idx] = std::make_unique<char[]>(new_size);
  }
  const auto end = std::chrono::steady_clock::now();
  const auto num_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  const auto us_per_allocation = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / kNumIterations;
  std::cout << kNumIterations << " allocations Done in " << num_ms << "ms." << std::endl;
  std::cout << "Avg " << us_per_allocation << " us per allocation" << std::endl;
  return 0;
} 

unfortunately runs significantly slower with mimalloc v2.0.2 on my machine (Ubuntu linux, x86_64 CPU) than with default system allocator.

example with system default allocator: 1000 allocations Done in 3366ms. Avg 3366 us per allocation example with mimalloc: 1000 allocations Done in 9161ms. Avg 9161 us per allocation

The issue seems to be caused by lots of mmap/munmap syscalls by mimalloc compared to default allocator:

**perf trace -s bazel-bin/third_party/mimalloc/mimalloc_test**
10000 allocations Done in 488ms.
Avg 48 us per allocation

 Summary of events:
 mimalloc_test (19177), 123062 events, 100.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   munmap             36680   117.243     0.001     0.003     0.052      0.35%
   mmap               19973    44.567     0.002     0.002     0.032      0.23%
   madvise             4724     7.485     0.001     0.002     0.003      0.21%

**perf trace -s bazel-bin/third_party/mimalloc/stdalloc_test**
10000 allocations Done in 23ms.
Avg 2 us per allocation

 Summary of events:
 stdalloc_test (19210), 1400 events, 99.3%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   brk                  506     4.638     0.002     0.009     1.376     29.83%
   openat                46     0.290     0.004     0.006     0.018      7.19%
   mmap                  43     0.204     0.003     0.005     0.009      5.86%
   munmap                27     0.203     0.005     0.008     0.016      5.48%

In such a scenario mimalloc calls mmap even more than once for each malloc call.

The most interesting part is that the issue does not reproduce with stable (v1.7.2) mimalloc, hence the regression.

PS. I realize v2.0.2 is a beta version but hope my case will help it to become stable one day. Beside this issue everything else works perfectly for me:)

jasongibson commented 3 years ago

I see up to a 20% performance drop in v2 vs v1 as well, but on Windows (I haven't benchmarked the two on Linux). I can profile it if that's useful.

daanx commented 3 years ago

Ah very interesting -- thanks for the benchmark. v2 does behave better with regard to fragmentation (especially on large workloads) but it also uses different heuristics to allow the OS to reuse (virtual memory) pages in the (virtual) address space. This is great for reducing (real) fragmentation and working set, but it looks like there is something wrong in the heuristicts/strategy and it is great you were able to isolate this. I will get back later this week on this issue.

Lnd-stoL commented 3 years ago

UPD: setting huge enough MIMALLOC_RESERVE_OS_MEMORY fixes the slowdown. I'm not sure how mimalloc actually handles "pre reserved" memory but somehow it helps. Any news here?

daanx commented 3 years ago

@Lnd-stoL, I worked on this more over the past days, here are some thoughts:

  1. Just to add perspective, the benchmark is of course atypical as it allocates large buffers without using them. Most (all?) allocators make the assumption that larger allocations can be slower than small ones as it is assumed you will do more work with them. (I suspect adding actual computations on all elements in the buffers may already make the benchmark more equal as the relative work done in allocation becomes less).

  2. Why is v1.7.x faster than v2.0.x. ? The benchmark happens to only allocate large buffers (> 5MiB up to 25 MiB). On v1.7.x there are regions of (virtual) 256MiB that mimalloc manages to avoid expensive eOS calls and the buffers happen to fit in this and thus it is fast. On the v2.0.x there are no more regions but instead there is a direct cache of segments. -- unfortunately, the big buffers do not fit in a segment and thus get directly allocated/freed by the OS every time which is thus slower.

What to do?

I will do more testing with upfront reservation and larger segments on some large scale services and see how things go.

jasongibson commented 3 years ago

Thanks for taking a look at this and giving these suggestions.

We're benchmarking them and will post the results next week.

jasongibson commented 3 years ago

Here are some results for the MI_SEGMENT_SLICE_SHIFT option. This is with dev-slice@725fe2ac7d7e767caeeb7eb82af1729fa90cc5c2.

The host process using mimalloc here is a server process that answers client requests, forking on Linux and threading on Windows. Most requests are short lived and the underlying forked process/thread is not reused. Most allocations are small and frequent.

So, 2.0+MI_SEGMENT_SLICE_SHIFT is fine speed-wise, and in-between 1.7/2.0 for memory usage.

We have not yet evaluated MIMALLOC_RESERVE_OS_MEMORY, but will do so.

daanx commented 3 years ago

@jasongibson wow, that is great info! I have not yet found a program that I can run that clearly shows the differences in a reproducible way; I need to think a bit about these results but my initial thoughts are:

Thanks!

Lnd-stoL commented 3 years ago

Thanks for the explanation.

one option is to do nothing as this may never be exhibited by "real" workloads

Just to say it, we actually faced the issue in a pretty "real" workload) The posted code is just a minimal repro example.

Another one is to mimic v1.7 by reserving OS memory explicitly

This works for me, but is only applicable when needed memory amount is known beforehand.

Another option is to increase the segment size

Tuning segment size really fixes the issue and is actually what I guessed to do) But it also lead to more RSS pressure. Not critical but unpleasant. So

This can be addressed -- I will actually try to do this and perhaps we can then test again and see if it improves things.

these are great news.

BTW, could you please explain what exactly "segment cache" is for and what are the cases when it is profitable to increase its size.

daanx commented 2 years ago

Hi, I think this issue is fixed now in version v2.0.3 -- I hope this is also measurable in realworld workloads; let me know how it goes.

@Lnd-stoL: I liked your benchmark and added the code as a malloc-large benchmark in the mimalloc-bench repository attributed to you. Let me know if you are ok with this and if not, I will remove it ASAP.

Lnd-stoL commented 2 years ago

Hi!

Hi, I think this issue is fixed now in version v2.0.3 -- I hope this is also measurable in realworld workloads; let me know how it goes.

Will check it.

Let me know if you are ok with this and if not, I will remove it ASAP.

This is absolutely ok)

jasongibson commented 2 years ago

I'll report back as well.

jasongibson commented 2 years ago

We finished our testing, and the 2.0.3 changes look good!

Given the lower memory retention in 2.0.3, overall performance wins on Linux, and general parity on Windows, we've got no reason not to upgrade from the 1.7 line to 2.x, and this ticket is fixed from our perspective. Lookin' forward to 2.x being the official version.

Thank you!

daanx commented 2 years ago

Ah, I missed this earlier, but that is great to hear @jasongibson. Thanks so much for testing.