microsoft / mimalloc

mimalloc is a compact general purpose allocator with excellent performance.
MIT License
9.79k stars 802 forks source link

High virtual memory usage, reset_decommits doesn't work #537

Open Zeblote opened 2 years ago

Zeblote commented 2 years ago

Hi! I'm evaluating using mimalloc-secure 2.0.3 for our game instead of Unreal Engine's default FMallocBinned2. The runtime performance seems good, I have not measured any significant regression from this switch. However, it seems to not decommit memory nearly as much if the memory usage of the game goes down.

My configuration: MI_SECURE = 4

The target for comparison, using FMallocBinned2:

The initial results using mimalloc:

That's... not great in comparison.

I enabled the options page_reset, abandoned_page_reset, gives significant improvements for the working set, but no improvement on unnecessary memory being decommitted again.

The option segment_reset seems to not be hooked up to anything. The option reset_decommits causes the game to crash instantly during static initialization, if page_reset is also enabled (otherwise it does nothing): image image

daanx commented 2 years ago

Hi @Zeblote; thanks for your report -- I would like to see if we can improve this; It would be great if you can supply me with a test that I can run myself but otherwise we'll try here :-).

It looks that perhaps during a save and then reload there are many terminated threads that "abandon" their heaps. This memory is usually slowly recycled on demand and that is I think (maybe?) why you see the high commit charge after a clear. (and usually this is good -- we do not want to decommit memory that we are going to use again soon after). Enabling abandoned_page_reset helps here so keep that on.

Can you also try:

  1. set MIMALLOC_SEGMENT_DECOMMIT_DELAY=0 ; this will aggressively decommit for abandoned heaps.
  2. If this does not help, if possible, call mi_collect(true) on threads that are about to be terminated (or right after a clear). I understand this may be hard depending on the source code. Let me know.

Best, Daan

Ps. using secure mode is more expensive; it is amazing if you see similar runtime's in that case compared to the other allocator. Have you tried if you see the same commit charge in the plain version (without MI_SECURE) ? I think it should be the same but it may be good to verify this.

Zeblote commented 2 years ago

Thanks for the tips, I will try them all in just a bit! I didn't know about dev-slice branch, do you recommend only using it for testing or also for production use?

You mention calling mi_collect() on specific threads, does that mean it doesn't have global effects? I had already hooked it up to the TrimMemory hook Unreal Engine calls roughly once a minute on the main thread, after its garbage collector has dumped unused objects and assets. Maybe that's not how it's intended to be used though.

Note that on windows a higher commit charge does not mean it uses that memory

Fun story there - it can! Particularly, if your binaries do not yet have great reputation with Windows Defender (i.e. are not digitally signed). I was recently investigating a highly mysterious issue where all of the committed but unused memory was suddenly getting moved to the working set for literally no reason. So I'd load the save, it'd use 6GB ram, I'd leave it idle for a while, and suddenly, out of nowhere, it will jump up to using 11GB ram. It doesn't happen in the previous release build that's been launched by many thousands of users, but it happens to every new build I create.

After many, many hours of puzzling I was finally able to track it down using Windows Performance Recorder/Analyzer to be caused by MsMpEng.exe using ReadProcessMemory to scan all committed memory of my process, causing it to get backed by physical pages so it can read it. Still stumped on this one, and I really hope it won't happen for users once the build is deployed. Two users I gave the build to for testing were also able to reproduce it, but that's probably not enough to affect Windows Defender reputation. Since you work on this memory related stuff, is there a chance you could have the right connections to get this looked at..?

I can provide screenshots, .etl traces capturing the problem, etc whatever is necessary to demonstrate that Windows Defender is misbehaving.

Currently, I have added the dev workspace to Windows Defender exclusions to be able to do sensible tests with these allocators without it interfering like this. It is 100% reproducible on my machine, the dev build works fine until Windows Defender touches it and doubles the memory usage for no reason.

daanx commented 2 years ago

Thanks. Yikes -- Windows Defender touching any committed memory would not be good for performance for large services and applications. I have no connections to that team but will ask around a bit. (but best if you are able to file a report through other channels). It may be by design though..

The dev branch is the latest 1.7.x version, and dev-slice is the latest 2.0.x version -- these are not for production use since those are the development branches. They are usually quite stable though -- I actually hope to bump to v2.0.4 this week. The 2.x versions have another internal way of allocating mimalloc "pages" and was made to be better at decommitting and reusing memory for long running services or large memory footprints. (That is why I really hope to address your issues as that is unusual. )

I checked again, and my earlier suggestion:

If this does not help, if possible, call mi_collect(true) on threads that are about to be terminated (or right after a clear). I understand this may be hard depending on the source code. Let me know.

will not help; the collect is already done automatically on thread termination. Hope you didn't already spend time on this..

mi_collect is indeed not global but per thread. This is actually ok I think, but we may want to extend it to also decommit in the abandoned heaps from terminated threads as it is my hunch that this is what we are seeing in your situation. Not sure. Let's first see what happens when segment_decommit_delay is set to 0.

Zeblote commented 2 years ago

@daanx Thanks for the response! I did not get to experiment more with it yet, but will soon.

I don't think there are many terminated threads, by the way. Unreal Engine creates like 50 threads at the start and then runs a task graph on them, so they are always reused until the game is closed.

(but best if you are able to file a report through other channels)

Unfortunately, not being part of a company with some enterprise support contract, we don't have any such channel beyond the standard Microsoft support, who wasn't able to understand the problem and giving me unrelated copy-paste answers.

It may be by design though..

I can't imagine that. I don't think I've ever seen it happen in the past, and surely Windows Defender could have some way to determine whether memory regions are actually allocated physical pages, or... not do that? It is kind of a disaster. Especially since there is nothing we can do about this virtual memory usage, it is done by the d3d11 library as part of allocating GPU buffers and then abandoned. And once this has marked it as used, it just sits around in the working set, since the program is not aware of this happening and can't tell windows that it doesn't actually need this memory...

If it really is by design, does some way exist to have release builds start out with positive reputation? Would signing the binaries with an EV cert help? Since I noticed that the issue does not happen to any other signed software or to our old build that's been around for a while.

Zeblote commented 2 years ago

but will ask around a bit

That would be awesome! Here are some images from one of the traces I captured that demonstrate the problem.

General overview showing that at some point, the working set magically rises to cover almost all of the commit while the game is idle: image

image

Zooming in on the problematic section and filtering down a bit, demonstrating Windows Defender causing millions of page accesses during that exact time: image

That count multiplied by 4KB pages equals roughly the amount of memory that was unnecessarily added to the working set.

Zeblote commented 2 years ago

Of course, if this really is intentional behavior, VirtualAlloc documentation must be updated. image

But I really, really, doubt it is. It seems more like a bug in Windows Defender?

Zeblote commented 2 years ago

By the way, the pages mimalloc resets with this process

The page_reset on Windows will do MEM_RESET which will make the memory available to other processes but it will not show up as decommitted. (Note that on windows a higher commit charge does not mean it uses that memory -- it is just backed by the OS and most of it may be virtual (due to MEM_RESET for example). The working set should be as low as possible though)

are also brought back to life by the Windows Defender problem.

daanx commented 2 years ago

Hi @Zeblote; thanks for the detailed report -- if I can find the relevant people I will forward this information.

Zeblote commented 2 years ago

@daanx I finally got to try the dev-slice branch and that new option. Here are the results:

FMallocBinned2 (start, load, clear, load) perfmon_2022-02-04_04-38-21 perfmon_2022-02-04_04-39-10 perfmon_2022-02-04_04-39-35 perfmon_2022-02-04_04-40-23

mimalloc MI_SECURE=4, page_reset=1, abandoned_page_decommit=1, segment_decommit_delay=0 (start, load, clear, load) perfmon_2022-02-04_04-42-17 perfmon_2022-02-04_04-43-02 perfmon_2022-02-04_04-43-31 perfmon_2022-02-04_04-44-11

It doesn't seem like much changed?

I think these are acceptable results though. Not particularly concerned with the higher working set after clearing the save since it seems to be correctly reused on loading the next one.

The deallocations during clearing the save with this config seem to be significantly slower than FMallocBinned2 or mimalloc with page_reset disabled. But disabling that makes it never free any memory at all, which is unusable.

Zeblote commented 2 years ago

Perhaps interesting, also setting decommit_delay=0 makes clearing the save even slower, but it reduces the commit overall and brings us closer to the results of FMallocBinned2 in that column. Are delayed decommits working correctly? Does mimalloc need to be "ticked" in some way?

mimalloc MI_SECURE=4, page_reset=1, abandoned_page_decommit=1, decommit_delay=0, segment_decommit_delay=0 (start, load, clear, load) perfmon_2022-02-04_05-02-52 perfmon_2022-02-04_05-03-58 perfmon_2022-02-04_05-04-46 perfmon_2022-02-04_05-05-25

Zeblote commented 2 years ago

I think FMallocBinned2 might be a good benchmark to compare against in general, at least in Unreal Engine packaged builds it seems to beat every other allocator implementation in both allocation/deallocation performance and total memory usage at the same time. But of course it has no security features at all, hence this experiment.

daanx commented 2 years ago

Thanks for your detailed reports @Zeblote! Great. I hope we can make it work better out of the box without needing these aggressive settings.

I made changes to mi_collect(true) and I think it might help calling that when clearing the save (from just the main thread). Wait a bit until I have done more experiments .. but then it would be great if you can test it again :-)

daanx commented 2 years ago

I pushed a new release today that has more aggressive settings for mi_collect --would be great if you could try again with v2.0.5 and see if things improved. Thanks!

Zeblote commented 2 years ago

Hmm. Here are a few tests with 2.0.5, mi_collect(true) has been called on the main thread at each step before capturing the memory usage. @daanx

FMallocBinned2 (goal, every value is lower than the lowest achieved one with mimalloc) perfmon_2022-02-16_01-41-42 perfmon_2022-02-16_01-42-38 perfmon_2022-02-16_01-43-19 perfmon_2022-02-16_01-44-27

MI_SECURE=4 perfmon_2022-02-16_01-03-17 perfmon_2022-02-16_01-04-11 perfmon_2022-02-16_01-04-51 perfmon_2022-02-16_01-06-01

MI_SECURE=4 segment_decommit_delay=0 (seems about same as previous?) perfmon_2022-02-16_01-26-50 perfmon_2022-02-16_01-28-01 perfmon_2022-02-16_01-28-35 perfmon_2022-02-16_01-29-28

MI_SECURE=4 page_reset=1 (slower to clear save, much lower working set, no effect on commit) perfmon_2022-02-16_01-08-37 perfmon_2022-02-16_01-09-39 perfmon_2022-02-16_01-10-07 perfmon_2022-02-16_01-11-08

MI_SECURE=4 page_reset=1 segment_decommit_delay=0 (seems about same as previous?) perfmon_2022-02-16_01-20-19 perfmon_2022-02-16_01-21-19 perfmon_2022-02-16_01-22-15 perfmon_2022-02-16_01-23-18

MI_SECURE=4 decommit_delay=0 (much slower to clear save, much lower working set and commit) perfmon_2022-02-16_01-13-16 perfmon_2022-02-16_01-14-37 perfmon_2022-02-16_01-15-33 perfmon_2022-02-16_01-17-03

MI_SECURE=4 page_reset=1 decommit_delay=0 (very much slower to clear save, seems kinda same as previous?) perfmon_2022-02-16_01-36-26 perfmon_2022-02-16_01-37-28 perfmon_2022-02-16_01-38-06 perfmon_2022-02-16_01-39-04

Zeblote commented 2 years ago

I'm still confused why decommit_delay=0 has such a massive effect. Shouldn't its only change be that things are cleared faster? Or maybe it's badly named for what it does?