microsoft / clrmd

Microsoft.Diagnostics.Runtime is a set of APIs for introspecting processes and dumps.
MIT License
1.05k stars 255 forks source link

Kubernetes pods OOMKilled on DataTarget.CreateSnapshotAndAttach #981

Closed kierenj closed 2 years ago

kierenj commented 2 years ago

When running on Kubernetes, at the point of calling DataTarget.CreateSnapshotAndAttach on a different .NET process, I often (but not always) have the pod killed with OOMKilled. If I increase the memory limit (fairly significantly), it doesn't occur.

To a certain extent this makes sense - I think that method "duplicates" the process. But it surprised me because the processes really use very little memory indeed (~50-100MB of allocated memory (via GC.GetTotalMemory()) with limits several times that.

Back in the day, I had issues with .NET Core apps hitting the memory limit because the GC (I think) wasn't looking at the cgroups info correctly, if I remember correctly? Could that be the case here - I guess, is there a rule of thumb I can use for setting an appropriate memory limit in Kubernetes in a way that works with DataTarget.CreateSnapshotAndAttach?

leculver commented 2 years ago

Sorry I don't have extensive knowledge of kubernetes. In terms of memory usage, it was my vague understanding that memory isn't completely duplicated by PssCaptureSnapshot unless the page of memory in the target process is modified. So basically, it's my (possibly mistaken) understanding that creating a snapshot should be fairly low memory usage unless the target process is really churning through a lot of memory changes at the time.

Note that ClrMD does have some options to reduce its memory usage but it will come at a high performance penalty.

To answer your question though, I unfortunately don't have any recommendations with using ClrMD in a memory limit. There's a lot of moving parts in ClrMD under the hood which we don't fully control, and keeping it capped to a certain amount of memory (or even knowing the appropriate amount of memory needed) is unfortunately not something I really know how to quantify. It would be highly dependent on what you are doing with the API.

(Closing this issue since I don't think I can help, but I'm happy to re-open or reply further if you have more questions!)