golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.25k stars 17.57k forks source link

runtime/debug: soft memory limit #48409

Closed mknyszek closed 2 years ago

mknyszek commented 3 years ago

Proposal: Soft memory limit

Author: Michael Knyszek

Summary

I propose a new option for tuning the behavior of the Go garbage collector by setting a soft memory limit on the total amount of memory that Go uses.

This option comes in two flavors: a new runtime/debug function called SetMemoryLimit and a GOMEMLIMIT environment variable. In sum, the runtime will try to maintain this memory limit by limiting the size of the heap, and by returning memory to the underlying platform more aggressively. This includes with a mechanism to help mitigate garbage collection death spirals. Finally, by setting GOGC=off, the Go runtime will always grow the heap to the full memory limit.

This new option gives applications better control over their resource economy. It empowers users to:

Details

Full design document found here.

Note that, for the time being, this proposal intends to supersede #44309. Frankly, I haven't been able to find a significant use-case for it, as opposed to a soft memory limit overall. If you believe you have a real-world use-case for a memory target where a memory limit with GOGC=off would not solve the same problem, please do not hesitate to post on that issue, contact me on the gophers slack, or via email at mknyszek@golang.org. Please include as much detail as you can.

gopherbot commented 3 years ago

Change https://golang.org/cl/350116 mentions this issue: design: add proposal for a soft memory limit

mpx commented 3 years ago

Afaict, the impact of memory limit is visible once the GC is CPU throttled, but not before. Would it be worth exposing the current effective GOGC as well?

mknyszek commented 3 years ago

@mpx I think that's an interesting idea. If GOGC is not off, then you have a very clear sign of throttling in telemetry. However, if GOGC=off I think it's harder to tell, and it gets blurry once the runtime starts bumping up against the GC CPU utilization limit, i.e. what does effective GOGC mean when the runtime is letting itself exceed the heap goal?

I think that's pretty close. Ideally we would have just one metric that could show, at-a-glance, "are you in the red, and if so, how far?"

raulk commented 3 years ago

In case you find this useful as a reference (and possibly to include in "prior art"), the go-watchdog library schedules GC according to a user-defined policy. It can infer limits from the environment/host, container, and it can target a maximum heap size defined by the user. I built this library to deal with https://github.com/golang/go/issues/42805, and ever since we integrated it into https://github.com/filecoin-project/lotus, we haven't had a single OOM reported.

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc commented 2 years ago

@mknyszek what is the status of this?

mknyszek commented 2 years ago

@rsc I believe the design is complete. I've received feedback on the design, iterated on it, and I've arrived at a point where there aren't any major remaining comments that need to be addressed. I think the big question at the center of this proposal is whether the API benefit is worth the cost. The implementation can change and improve over time; most of the details are internal.

Personally, I think the answer is yes. I've found that mechanisms that respects users' memory limits and that give the GC the flexibility to use more of the available memory are quite popular. Where Go users implement this themselves, they're left working with tools (like runtime.GC/debug.FreeOSMemory and heap ballasts) that have some significant pitfalls. The proposal also takes steps to mitigate the most significant costs of having a new GC tuning knob.

In terms of implementation, I have some of the foundational bits up for review now that I wish to land in 1.18 (I think they're uncontroversial improvements, mostly related to the scavenger). My next step is create a complete implementation and trial it on real workloads. I suspect that a complete implementation won't land in 1.18 at this point, which is fine. It'll give me time to work out any unexpected issues with the design in practice.

rsc commented 2 years ago

Thanks for the summary. Overall the reaction here seems overwhelmingly positive.

Does anyone object to doing this?

kent-h commented 2 years ago

I have some of the foundational bits up for review now that I wish to land in 1.18

I suspect that a complete implementation won't land in 1.18

@mknyszek I'm somewhat confused by this. At a high level, what are you hoping to include in 1.18, and what do you expect to come later? (Specifically: will we have extra knobs in 1.18, or will these changes be entirely internal?)

mknyszek commented 2 years ago

@Kent-H The proposal has not been accepted, so the API will definitely not land in 1.18. All that I'm planning to land is work on the scavenger, to make it scale a bit better. This is useful in its own right, and it happens that the implementation of SetMemoryLimit as described in the proposal depends on it. There won't be any internal functionality pertaining to SetMemoryLimit in the tree in Go 1.18.

rsc commented 2 years ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

rsc commented 2 years ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

gopherbot commented 2 years ago

Change https://go.dev/cl/393401 mentions this issue: runtime: add a non-functional memory limit to the pacer

gopherbot commented 2 years ago

Change https://go.dev/cl/353989 mentions this issue: runtime: add GC CPU utilization limiter

gopherbot commented 2 years ago

Change https://go.dev/cl/393400 mentions this issue: runtime: add byte count parser for GOMEMLIMIT

gopherbot commented 2 years ago

Change https://go.dev/cl/394220 mentions this issue: runtime: maintain a direct count of total allocs and frees

gopherbot commented 2 years ago

Change https://go.dev/cl/394221 mentions this issue: runtime: set the heap goal from the memory limit

gopherbot commented 2 years ago

Change https://go.dev/cl/393402 mentions this issue: runtime: track how much memory is mapped in the Ready state

gopherbot commented 2 years ago

Change https://go.dev/cl/397018 mentions this issue: runtime/debug: export SetMemoryLimit

gopherbot commented 2 years ago

Change https://go.dev/cl/397015 mentions this issue: runtime: remove float64 multiplication in heap trigger compute path

gopherbot commented 2 years ago

Change https://go.dev/cl/397016 mentions this issue: runtime: create async work queue to handle runtime triggers

gopherbot commented 2 years ago

Change https://go.dev/cl/397017 mentions this issue: runtime: make the scavenger and allocator respect the memory limit

gopherbot commented 2 years ago

Change https://go.dev/cl/397014 mentions this issue: runtime: check the heap goal and trigger dynamically

gopherbot commented 2 years ago

Change https://go.dev/cl/397679 mentions this issue: runtime: update inconsistent gcController stats more carefully

gopherbot commented 2 years ago

Change https://go.dev/cl/397678 mentions this issue: runtime: move inconsistent memstats into gcController

gopherbot commented 2 years ago

Change https://go.dev/cl/397677 mentions this issue: runtime: clean up inconsistent heap stats

gopherbot commented 2 years ago

Change https://go.dev/cl/399014 mentions this issue: runtime: replace PI controller in pacer with simpler heuristic

gopherbot commented 2 years ago

Change https://go.dev/cl/398834 mentions this issue: runtime: rewrite pacer max trigger calculation

gopherbot commented 2 years ago

Change https://go.dev/cl/399474 mentions this issue: runtime: redesign scavenging algorithm

gopherbot commented 2 years ago

Change https://go.dev/cl/403614 mentions this issue: runtime/metrics: add /gc/cpu/limiter-overflow:cpu-seconds metric

mknyszek commented 2 years ago

The core feature has landed, but I still need to land a few new metrics to help support visibility into this.

gopherbot commented 2 years ago

Change https://go.dev/cl/406574 mentions this issue: runtime: reduce useless computation when memoryLimit is off

gopherbot commented 2 years ago

Change https://go.dev/cl/406575 mentions this issue: runtime: update description of GODEBUG=scavtrace=1

gopherbot commented 2 years ago

Change https://go.dev/cl/410735 mentions this issue: doc/go1.19: adjust runtime release notes

gopherbot commented 2 years ago

Change https://go.dev/cl/410734 mentions this issue: runtime: document GOMEMLIMIT in environment variables section

rabbbit commented 1 year ago

Hey @mknyszek - first of all, thanks for the excellent work; this is great.

I wanted to share our experience thinking about enabling this in production. It works great and exactly as advertised. Some well-maintained applications have enabled it with great success, and the usage is spreading organically.

We'd ideally want to enable it for everyone by default (a vast majority of our applications have plenty of memory available), but we're currently too afraid to do this. The reason is the death spirals you called in the proposal. Applications leaking memory, with GOMEMLIMIT, can get to a significantly degraded state. Paradoxically, those applications prefer to OOM, die quickly and be restarted than to struggle for a long time. The number of applications makes avoiding leaks unfeasible.

A part of the problem (perhaps) is that we lack a good enough way of setting the right limit. We cannot set it to 98-99% of the container memory because some other applications can be running there. But, if we set it to 90%, once we hit the death spiral situation, we're in a degraded state for too long - it can take hours for OOM, and in the meantime, we are at risk of all containers of an application entering the degraded state.

Another aspect is that our containers typically don't use close to all the available CPU time. So the assumption from the gc-guide, while true, has a slightly different result in practice:

The intuition behind the 50% GC CPU limit is based on the worst-case impact on a program with ample available memory. In the case of a misconfiguration of the memory limit, where it is set too low mistakenly, the program will slow down at most by 2x, because the GC can't take more than 50% of its CPU time away.

The GC might use at most 50% of the total CPU time, but it can end up using 2-3x more CPU than the actual application work. This is "GC degradation" would be hard to explain/sell to application owners.

We're also concerned with a "degradation on failover" situation - an application that might be okay usually, in case of a sudden increase in traffic, might end up in a death spiral. And this would be precisely the time we need to avoid those.

What we're doing now is:

Hope this is useful. Again, thanks for the excellent work.

mknyszek commented 1 year ago

Thanks for the detailed feedback and I'm glad it's working well for your overall!

Speaking broadly, I'd love to know more about what exactly this degraded state looks like. What is the downstream effect? Latency increase? Throughput decrease? Both? If you could obtain a GODEBUG=gctrace=1 (outputs to STDERR) of this degraded state, that would be helpful in identifying what if any next steps we should take.

We'd ideally want to enable it for everyone by default (a vast majority of our applications have plenty of memory available), but we're currently too afraid to do this. The reason is the death spirals you called in the proposal. Applications leaking memory, with GOMEMLIMIT, can get to a significantly degraded state. Paradoxically, those applications prefer to OOM, die quickly and be restarted than to struggle for a long time. The number of applications makes avoiding leaks unfeasible.

Choosing to die quickly over struggling for a long time is an intentional point in the design. In these difficult situations something has to give and we chose to make that memory.

But also if the scenario here is memory leaks, it's hard to do much about that without fixing the leak. The live heap will grow and eventually even without GOMEMLIMIT you'll OOM as well. GOMEMLIMIT isn't really designed to deal with a memory leak well (generally, we consider memory leaks to be a bug in long-running applications), and yeah I can see turning it on basically turning into "well, it just gets slower before it dies, and it takes longer to die," which may be worse than not setting a memory limit at all.

As for fixing memory leaks, we're currently planning some work on improving the heap analysis situation. I hope that'll make keeping applications leak-free more feasible in the future. (#57447)

(I recognize that encountering a memory leak bug at some point is inevitable, but in general we don't expect long-running applications to run under the expectation of memory leaks. I also get that it's a huge pain these days to debug them, but we're looking into trying to make that better with heap analysis.)

A part of the problem (perhaps) is that we lack a good enough way of setting the right limit. We cannot set it to 98-99% of the container memory because some other applications can be running there. But, if we set it to 90%, once we hit the death spiral situation, we're in a degraded state for too long - it can take hours for OOM, and in the meantime, we are at risk of all containers of an application entering the degraded state.

FTR that's what the runtime/debug.SetMemoryLimit API is for and it should be safe (performance-wise) to call with a relatively high frequency. Just to be clear, is this also the memory leak scenario?

The 90% case you're describing sounds like a misconfiguration to me; if the application's live heap is really close enough to the memory limit to achieve this kind of death spiral scenario, then the intended behavior is to die after a relatively short period, but it might not if it turns out there's actually plenty of available memory. However, this cotenant situation might not be ideal for the memory limit to begin with.

As a general rule, the memory limit, when used in conjunction with GOGC=off, is not a great fit for an environment where the Go program is potentially cotenant with others, and the others don't have predictable memory usage (or the Go application can't easily respond to cotenant changes). See https://go.dev/doc/gc-guide#Suggested_uses. In this case I'd suggest slightly overcommitting the memory limit to protect against many transient spikes in memory use (in your example here, maybe 95-96%), but set GOGC to something other than off.

The GC might use at most 50% of the total CPU time, but it can end up using 2-3x more CPU than the actual application work. This is "GC degradation" would be hard to explain/sell to application owners.

I'm not sure I follow. Are you describing a situation in which your application is using say, 25% CPU utilization, and the GC is eating up 50%?

We're also concerned with a "degradation on failover" situation - an application that might be okay usually, in case of a sudden increase in traffic, might end up in a death spiral. And this would be precisely the time we need to avoid those.

(Small pedantic note, but the 50% GC CPU limiter is a mechanism to cut off the death spiral; in general a death spiral means that the GC keeps taking on more and more of the CPU load until application progress stops entirely.)

I think it depends on the load you're expecting. It's always possible to construct a load that'll cause some form of degradation, even when you're not using the memory limit (something like a tight OOM loop as the service gets restarted would be what I would expect with just GOGC).

If the memory limit is failing to degrade gracefully, then that's certainly a problem and a bug on our side (perhaps even a design flaw somewhere!). (Perhaps this risk of setting a limit too low such that you sit in the degraded state for too long instead of actually falling over can be considered something like failing to degrade gracefully, and that suggests that even 50% GC CPU is trying too hard as a default. I can believe that but I'd like to acquire more data first.)

However, without more details about the scenario in question, I'm not sure what else we can do to alleviate the concern. One idea is a backpressure mechanism (#29696), but for now I think we've decided to see what others can build since this wisdom of this space seems to have shifted a few times over the last few years (e.g. what metric should we use? Memory? CPU? Scheduling latency? A combination? If so, what combination and weighted how? Perhaps it's very application-dependent?).

What we're doing now is:

As a final note, I just want to point out that at the end of the day, the memory limit is just another tool in the toolkit. If you can make some of your applications work better without it, I don't think that necessarily means it's a failure of the memory limit (sometimes it might be, but not always). I'm not saying that you necessarily think the memory limit should be used everywhere, just wanted to leave that here for anyone who comes looking at this thread. :)

cdvr1993 commented 1 year ago

Hi @mknyszek

Regarding the 50% cpu limit... Unless we understand incorrectly it means it can use up to that CPU to avoid going over the soft limit, but for many of our applications anything more than 20% GC CPU can have a serious impact (mostly when on failover state). Currently, we dynamically change GOGC when there is memory available we tend to increase it, when there isn't we just keep decreasing it to ensure our own soft limit, but we have a minimum threshold and we allow different service owners to set their own minimum threshold. That's more or less what we are missing with Go soft limit.

We currently don't have an example using soft limit, but in the past we have had issues with GOGC being too low and this caused bigger problems than a few instances crashing due to OOM. So, based on that assumption we think the scenario would repeat with soft limit.

What would be nice is a way of modifying how much CPU the GC can take to ensure the soft limit? Or a minimum GOGC value so that service owners decide at what point they believe is better to OOM than the degradation caused to the elevated GC.

Or would you suggest is better to wait for #56857 to have a way to keep an eye on the size of live bytes, so that when it gets close to the soft limit make a decision of either eat the cost of GC or just OOM?

rabbbit commented 1 year ago

Thanks for the detailed feedback and I'm glad it's working well for your overall!

Speaking broadly, I'd love to know more about what exactly this degraded state looks like. What is the downstream effect? Latency increase? Throughput decrease? Both? If you could obtain a GODEBUG=gctrace=1 (outputs to STDERR) of this degraded state, that would be helpful in identifying what if any next steps we should take.

Getting the traces to work in production would be hard. We have an HTTP handler to tune GOMEMLIMIT per container, so we can experiment with that with reasonable safety. There's no way to runtime way to enable traces, right?

That being said I can perhaps try to reproduce the same situation in staging. What we have seen in production was a significant CPU time utilization increase, leading to CPU throttling, leading to both latency increase and throughput decrease.

Below is screenshot of a "slowly leaking application" (more explained below) where we enabled GOMEMLIMIT temporarily. Note the CPU utilization increased significantly more than we expected - more than 50% of GOMAXPROCS.

image

We'd ideally want to enable it for everyone by default (a vast majority of our applications have plenty of memory available), but we're currently too afraid to do this. The reason is the death spirals you called in the proposal. Applications leaking memory, with GOMEMLIMIT, can get to a significantly degraded state. Paradoxically, those applications prefer to OOM, die quickly and be restarted than to struggle for a long time. The number of applications makes avoiding leaks unfeasible.

Choosing to die quickly over struggling for a long time is an intentional point in the design. In these difficult situations something has to give and we chose to make that memory.

But also if the scenario here is memory leaks, it's hard to do much about that without fixing the leak. The live heap will grow and eventually even without GOMEMLIMIT you'll OOM as well. GOMEMLIMIT isn't really designed to deal with a memory leak well (generally, we consider memory leaks to be a bug in long-running applications), and yeah I can see turning it on basically turning into "well, it just gets slower before it dies, and it takes longer to die," which may be worse than not setting a memory limit at all.

As for fixing memory leaks, we're currently planning some work on improving the heap analysis situation. I hope that'll make keeping applications leak-free more feasible in the future. (#57447) (I recognize that encountering a memory leak bug at some point is inevitable, but in general we don't expect long-running applications to run under the expectation of memory leaks. I also get that it's a huge pain these days to debug them, but we're looking into trying to make that better with heap analysis.)

So I think you might be too optimistic vs what we see in our reality here (sorry:)). We:

  1. have applications that are leaking quick, they restart often, they need to be fixed. Those typically have higher priority, and can be diagnosed with some effort - I wouldn't actually call it pain though, profiles are typically helpful enough.
  2. "slowly leaking memory" applications that just very slowly accumulate memory as they run. These are actually low-priority - as long as the {release_frequency}>2-5*{time_to_oom}, fixing it will not get prioritized. Especially if some of the leaks are in gnarly bits like stat emission. This only becomes a problem during extended quiet periods - the expectation is still that the applications will crash rather than degrade.

In summary though, we strongly expect leaks to be around forever.

A part of the problem (perhaps) is that we lack a good enough way of setting the right limit. We cannot set it to 98-99% of the container memory because some other applications can be running there. But, if we set it to 90%, once we hit the death spiral situation, we're in a degraded state for too long - it can take hours for OOM, and in the meantime, we are at risk of all containers of an application entering the degraded state.

FTR that's what the runtime/debug.SetMemoryLimit API is for and it should be safe (performance-wise) to call with a relatively high frequency. Just to be clear, is this also the memory leak scenario?

Yeah, so we would need to continue running a custom tuner though, right? It also seems if we're tuning in "user-space", equivalent results can be achieved with GOGC and GOMEMLIMIT - right?

The 90% case you're describing sounds like a misconfiguration to me; if the application's live heap is really close enough to the memory limit to achieve this kind of death spiral scenario, then the intended behavior is to die after a relatively short period, but it might not if it turns out there's actually plenty of available memory. However, this cotenant situation might not be ideal for the memory limit to begin with.

As a general rule, the memory limit, when used in conjunction with GOGC=off, is not a great fit for an environment where the Go program is potentially cotenant with others, and the others don't have predictable memory usage (or the Go application can't easily respond to cotenant changes). See https://go.dev/doc/gc-guide#Suggested_uses. In this case I'd suggest slightly overcommitting the memory limit to protect against many transient spikes in memory use (in your example here, maybe 95-96%), but set GOGC to something other than off.

This is slightly more nuanced, (and perhaps offtopic) each of our containers runs with a "helper" process responsible for starting up and shipping logs and performing local health checks (it's silly. don't ask). The memory we need to reserve for it varies per application - thus, for small containers, 95% might not be enough. For larger applications, we can increase the limit, but for both cases, we'd likely still need to look at the log output dynamically.

It is not immediately clear to me how to tune the right value of GOGC combined with GOMEMLIMIT. But, more importantly, my understanding of GOMEMLIMIT is that no matter the GOGC value we can still hit the death-spiral situation.

The GC might use at most 50% of the total CPU time, but it can end up using 2-3x more CPU than the actual application work. This is "GC degradation" would be hard to explain/sell to application owners.

I'm not sure I follow. Are you describing a situation in which your application is using say, 25% CPU utilization, and the GC is eating up 50%?

Yeah, @cdvr1993 explained it in the previous comment too. If container has GOMAXPROCS=8, but utilized 3 at that time. Then we hit GOMEMLIMIT, and GC is allowed to (per our understanding) to use up to 4 cores, so GC is now using more CPU than the application. At the same time, anything above 80% CPU utilization (in our experience) results in dramatically increased latency.

We're also concerned with a "degradation on failover" situation - an application that might be okay usually, in case of a sudden increase in traffic, might end up in a death spiral. And this would be precisely the time we need to avoid those.

(Small pedantic note, but the 50% GC CPU limiter is a mechanism to cut off the death spiral; in general a death spiral means that the GC keeps taking on more and more of the CPU load until application progress stops entirely.)

Perhaps we need a different name here then:) What we've observed might not be a death spiral, but a degradation large enough to severely disrupt production. Even with the 50% limit.

I think it depends on the load you're expecting. It's always possible to construct a load that'll cause some form of degradation, even when you're not using the memory limit (something like a tight OOM loop as the service gets restarted would be what I would expect with just GOGC).

Yeah, the problem seems to occur for applications that are "mostly fine", with days between OOMs.

If the memory limit is failing to degrade gracefully, then that's certainly a problem and a bug on our side (perhaps even a design flaw somewhere!). (Perhaps this risk of setting a limit too low such that you sit in the degraded state for too long instead of actually falling over can be considered something like failing to degrade gracefully, and that suggests that even 50% GC CPU is trying too hard as a default. I can believe that but I'd like to acquire more data first.)

However, without more details about the scenario in question, I'm not sure what else we can do to alleviate the concern. One idea is a backpressure mechanism (#29696), but for now I think we've decided to see what others can build since this wisdom of this space seems to have shifted a few times over the last few years (e.g. what metric should we use? Memory? CPU? Scheduling latency? A combination? If so, what combination and weighted how? Perhaps it's very application-dependent?).

IMO it seems like what you built is "almost perfect". We just need the applications to "die faster" - the easiest changes that come to mind would be reducing the limit from 50%, to either something like 25% or a static value (2 cores?).

When I say "almost perfect" I mean it though - I suspect we could rollout the GOMEMLIMIT to 98% of our applications with great results and without a problem, but the remaining users would come after us with pitchforks. And that forces us to use the GOMEMLIMIT as an opt-in, which is very disappointing given the results we see in 98% of the applications.

Thanks for the thoughtful response!

rabbbit commented 1 year ago

Hey @mknyszek @cdvr1993 I raised a new issue in https://github.com/golang/go/issues/58106.

VEDANTDOKANIA commented 1 year ago

@rabbbit @mknyszek @rsc we are facing one issue regarding the memory limit . We are setting the memory limit to 18GB in 24GB server but still GC runs very frequently and eats up 80 percent of CPU and memory used is only 4 to 5 GB max . Also memory limit is goroutine wise ? or how to set the same for whole program.

In the entry point of our application we have specified something like this :-

debug.SetMemoryLimit(int64(8 1024 1024 * 1024))

Is this okay or we need to do something additional. Also where to set the optional unit as described in the documentation

mknyszek commented 1 year ago

@VEDANTDOKANIA Unfortunately I can't help with just the information you gave me.

Firstly, how are you determining that the GC runs very frequently, and that it uses 80 percent of CPU? That's far outside of the bounds of what the GC should allow: there's an internal limiter to 50% of available CPU (as defined by GOMAXPROCS) that will prioritize using new memory over additional CPU usage beyond that point.

Please file a new issue with more details, ideally:

Thanks.

Also memory limit is goroutine wise ? or how to set the same for whole program.

It's for the whole Go process.

In the entry point of our application we have specified something like this :-

debug.SetMemoryLimit(int64(8 1024 1024 * 1024))

That should work fine, but just so we're on the same page, that will set an 8 GiB memory limit. Note that the GC may execute very frequently (but again, still capped at roughly 50%) if this value is set smaller than the baseline memory use your program requires.

Also where to set the optional unit as described in the documentation

The optional unit is part of the GOMEMLIMIT environment variable that Go programs understand. e.g. GOMEMLIMIT=18GiB.