Open aclements opened 4 years ago
Out of curiosity I looked up when we started doing a GC every two minutes. It was introduced in https://golang.org/cl/5451057, a CL whose purpose was to release unused memory to the operating system. The specific suggestion to ensure that a GC is run at least once every two minutes was https://codereview.appspot.com/5451057/#msg14.
Of course the garbage collector is very different now than it was in 2012. But it does seem valid to consider whether an idle program will eventually release unused memory to the operating system.
Or we could monitor how much was released to the OS in a GC at N minutes that recovered R memory, and if it was not "enough", then delay next GC till either 2N minutes pass or 2R bytes allocated (or the GOGC threshold is hit). Reset this policy after periods of rapid allocation, and perhaps, start at a smaller number of minutes than 2.
This might have a poor interaction with finalizers in programs that relied on them being even vaguely timely.
Regarding the autoscaler, I am not entirely sure that this will seal the deal for them. Once upon a time I wrote an all-singing-all-dancing implementation of CRC32 for Java, that would use fork-join parallelism from the system pool to make CRC run faster (CRCs are somewhat embarrassingly parallel, the combine step requires a little work but not tons). I went so far as to demonstrate superlinear speedup. This did not get checked in, because the Java library team was apparently too nervous about people actually using fork-join parallelism in libraries, even though it was there for all the Collections stuff. My attitude was (and remains) "there is no such thing as unexpected parallelism in this millennium, and if a user doesn't want it to happen, they can size the system fork join pool to what they want, that's what it's there for". And of course, pay no attention to that multithreaded same-address-space optimizing compiler behind the curtain. But no dice, then.
My opinion is that if you don't want a process using more than N cores, that is what GOMAXPROCS is for, set it to the number that you want. An autotuning algorithm that gives you more cores after you use all the cores it told you you could use, is broken, especially now. We should not be surprised if people start using all the cores they can to handle embarrassingly parallel problems.
There's a difference between bulk computational tasks (where you can expect them to use what you give them) and demand-driven tasks (which will not use it, and if they're using it they might need more). Autoscalers are designed to respond to the latter type. But idle GC that uses easily 10x the CPU of the task's recent 1 minute usage completely throws off that assumption -- there's no way to tell the difference between "task is hammered with demand" and "task is doing a random GC".
But it does seem valid to consider whether an idle program will eventually release unused memory to the operating system.
:100:
I agree it would be a mistake to not let idle programs eventually return unused memory. For one, it would be pretty confusing for users and potentially even lead to more "maybe-a-memory-leak" issues being filed ("after restarting my application, go heap decreased significantly").
there's no way to tell the difference between "task is hammered with demand" and "task is doing a random GC"
Random suggestion: expose an expvar that tracks inflight real work (e.g. number of "real work" requests being currently served) and inhibit autoscaling if that metric is 0.
Or tweak the autoscaler to more quickly lower the reservation once the process goes idle to avoid the "unbounded" growth.
Alternatively, since idle GC isn't triggered by any real memory pressure, run it single-threaded?
Oops.
I want to clarify that I didn't mean the two minute GC. That also has some downsides, but I think the benefits outweigh the costs.
I was talking about how the garbage collector schedules GC work during idle time on every GC cycle. This is what interferes with autoscaling.
I see, IIUC what you mean then https://github.com/golang/go/issues/14812#issuecomment-518902192 (GC causing ~100ms latency spikes when running in containers with CPU limits) can also be considered as an additional argument in favor of fixing this.
It’s not just the write barrier and allocating black,the GC needs to race to a state where it isn’t trashing the caches and causing mutators to take capacity misses.
The problem of having multiple schedulers, each thinking they are omnipotent, is as fascinating as it is a hard research problem. If the autoscaler has access to various HW monitors perhaps they could be used to discriminate whether the demand increase is caused by the GC or by the mutators. It feels like a problem for machine learning using the GC logs to automate training.
Finally, I suspect Java and other GC languages have similar problems, does the autoscalor have this problem with Java and if not what can be learned.
On Fri, Jul 3, 2020 at 6:56 PM Carlo Alberto Ferraris < notifications@github.com> wrote:
I see, then IIUC what you mean then #14812 (comment) https://github.com/golang/go/issues/14812#issuecomment-518902192 can also be considered as a reason to fix this.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/39983#issuecomment-653691821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNNH7MSXBTKMUE2HS3YXTRZZO3DANCNFSM4OOCANKA .
Java's GC is heavily tunable. One could f.ex. restrict the number of threads it runs on.
I see, IIUC what you mean then #14812 (comment) (GC causing ~100ms latency spikes when running in containers with CPU limits) can also be considered as an additional argument in favor of fixing this.
@CAFxX , yep! (Thanks for that cross-reference, I've added it to the top post.)
In the short run, how would autoscalers respond if GC limited it's "looks like nobody's using those cores" resource consumption to min(idle/2, idle-1)? So for example, a single-threaded app, on a 2-core box, would leave the 2nd core alone and do 25% time-slicing against the mutator in one core.
3 and 4 cores -> GC takes 100% of one idle core (leaving 1 or 2 still idle) 5, 6 -> 2 (leaving 2 or 3 still idle)
(alternate formula: 1 + idle/3, which would use the second core for the 1-busy-1-idle case).
This doesn't mean the cores are actually idle, if there are other processes or if the OS decides it has important work to do, perhaps induced by GC activity.
Another possibility is capping the idle parallelism; have we measured how much marginal returns diminish as the number of cores devoted to GC increases? For the CRC example above, it turned out that 4 cores was very much a sweet spot; speedup fell into the almost-linear to superlinear range.
(I still think autoscalers will need to cope more gracefully with the existence of embarrassingly parallel subproblems. Clock rates are stuck, people will look for these. It would be nice if the OS/autoscaler had some way to communicate the "cost" of an extra core. Otherwise, sure the marginal returns per core are falling, but it's "my" core and it's "idle", so why not use it? )
Limiting idle GC is certainly a possibility. In the past we've talked about letting one P enter scheduler idle even if GC is running so that there's a P to quickly wake up and also to block in netpoll (though if you do wake it up, do you have to shut down idle GC running on another P to keep a spare P?).
It's actually been a very long time since we've measured the scalability of the garbage collector. It certainly has limits. I suspect they're high enough that they're significantly higher than a mostly-idle latency-sensitive application is likely to use on its own, but at least it would cap out the feedback cycle.
Circa 2002 when Intel first introduced hyperthreading (SMT) I was asked to find ways to use this extra underutilized hyperthread. I rewrote Cheney scan to use the extra hyperthread to effectively prefetch the targets of objects some distance ahead of the scan pointer. Seemed like a good idea at the time but was a loser due to limited memory bandwidth. Perhaps the Go GC is also memory bandwidth limited. If so then limiting the CPUs the GC gets based on available bandwidth is a win.
On Mon, Jul 6, 2020 at 12:22 PM Austin Clements notifications@github.com wrote:
Limiting idle GC is certainly a possibility. In the past we've talked about letting one P enter scheduler idle even if GC is running so that there's a P to quickly wake up and also to block in netpoll (though if you do wake it up, do you have to shut down idle GC running on another P to keep a spare P?).
It's actually been a very long time since we've measured the scalability of the garbage collector. It certainly has limits. I suspect they're high enough that they're significantly higher than a mostly-idle latency-sensitive application is likely to use on its own, but at least it would cap out the feedback cycle.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/39983#issuecomment-654336486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNNH4GMBMESJFRNCHTMWLR2H25DANCNFSM4OOCANKA .
This may be a naive question but here goes anyway: wouldn't it be ideal to give the idle GC its own pacer so that it can be throttled up/down to finish roughly when the next GC cycle is expected to start, as to avoid as much as possible any spiky behavior that we know can cause issues with autoscalers, cpu quotas, and latency in general?
A naive question that I have also asked, so I know some of the answer. Running the garbage collector has additional costs above and beyond the work of garbage collection;
From the point of view of energy efficiency, it is believed (with some historical evidence) that the "best" way to collect garbage is as quickly as possible, subject to constraints of letting the mutator get its job done. When I've raised the question of "does the GC's intermittent 25% cpu tax require a 33% container overprovision to ensure that latency/throughput goals are met?", the answer has been that in any large (multiple containers handling requests) service, this is not that different from variations in service load over time, variable request sizes, etc. Stuff happens, loads get balanced. And, this is probably fair, anyone running a single-node service probably has to overprovision anyway because reasons, and how much could a single node cost, anyway?
The Go 1.5 Concurrent GC Pacing https://docs.google.com/document/d/1wmjrocXIWTr1JxU-3EQBI6BK6KgtiFArkG47XK73xIQ/edit document does a good job of explaining the GC pacer.
On Mon, Jul 6, 2020 at 6:24 PM David Chase notifications@github.com wrote:
A naive question that I have also asked, so I know some of the answer. Running the garbage collector has additional costs above and beyond the work of garbage collection;
- memory is "allocated black" during garbage collection, meaning already marked which adds to the size of the live set at the end of GC.
- any writes to pointers must also process a write barrier, which slows down the application in general.
- the GC is thought to be generally disruptive of otherwise useful processing; it tends to trash caches and use a lot of memory bandwidth.
From the point of view of energy efficiency, it is believed (with some historical evidence) that the "best" way to collect garbage is as quickly as possible, subject to constraints of letting the mutator get its job done. When I've raised the question of "does the GC's intermittent 25% cpu tax require a 33% container overprovision to ensure that latency/throughput goals are met?", the answer has been that in any large (multiple containers handling requests) service, this is not that different from variations in service load over time, variable request sizes, etc. Stuff happens, loads get balanced. And, this is probably fair, anyone running a single-node service probably has to overprovision anyway because reasons, and how much could a single node cost, anyway?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/39983#issuecomment-654494750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNNH64KJ2UU5RYUEYEUMDR2JFJFANCNFSM4OOCANKA .
@mknyszek Does https://go-review.googlesource.com/c/go/+/393394/ fixes this issue?
I think it addresses the issue in the circumstance that matters most, i.e. idle applications. Theoretically an application that is idle enough to say, GC every minute, but not every 2 minutes, is probably going to have problems. That's a cliff which would still be nice to fix, so I'm inclined to leave this open.
I'm not sure we can justify removing the idle mark workers at this point. As I've noted on other issues, it seems to cost about +1% latency and -1% throughput in most situations where the application is active.
What version of Go are you using (
go version
)?But this has been true for years.
What did you do?
Run a mostly-idle application in a container with CPU limits under a mechanism that monitors CPU use and increases the job's CPU reservation if it appears to be too little. Specifically, this was observed with a latency-sensitive policy that looked at high percentile usage sampled over a short time period.
What did you expect to see?
Since the application is mostly idle, a small CPU reservation should be adequate and the auto-scaler should not need to grow that reservation.
What did you see instead?
Because the garbage collector attempts to use any idle cores up to GOMAXPROCS, even an otherwise mostly idle application will see periodic spikes of CPU activity. These will happen at least every 2 minutes if not more frequently. In this case, the auto-scaler's policy was sensitive enough that these spikes caused it to grow the job's CPU reservation. However, then the garbage collector uses all of the new CPU reservation. This leads to a feedback cycle where the auto-scaler continually grows the reservation.
See also #17969, https://github.com/golang/go/issues/14812#issuecomment-518902192.
/cc @mknyszek
Thoughts
We've been thinking for a while now that idle GC may be more trouble than it's worth. The idea was to speed up the mark phase, so the write barrier is on for less time and we allocate black for less time. However, if an application is mostly idle, then it's not very sensitive to the (very small) performance impact of the write barrier and probably doesn't produce much floating garbage from allocate-black; and if an application isn't mostly idle, then it's not benefiting from idle GC anyway.