runtime: make `GOMAXPROCS` cfs-aware on `GOOS=linux`

jcorbin commented 5 years ago

Problem

The default setting of runtime.GOMAXPROCS() (to be the number of os-apparent processors) can be greatly misaligned with container cpu quota (e.g. as implemented through cfs bandwidth control by docker).

This can lead to large latency artifacts in programs, especially under peak load, or when saturating all processors during background GC phases.

The smaller the container / larger the machine = the worse this effect becomes: let's say you deploy a fleet of micro service workers, each container having a cpu quota of 4, on a fleet of 32 processor[1] machines.

To understand why, you really have to understand the CFS quota mechanism; this blog post does well (with pictures); this kubernetes issue further explores the topic (especially as it relates to a recently resolved kernel cpu accounting bug). But to summarize it briefly for this issue:

there is a quota period, say 100ms
and there is then a quota, say 400ms to affect a 4-processor quota
within any period, once the process group exceeds its quota it is throttled

Running an application workload at a reasonable level of cpu efficiency makes it quite likely that you'll be spiking up to your full quota and getting throttled.

Background waste workload, like concurrent GC[2], is especially likely to cause quota exhaustion.

I hesitate to even call this a "tail latency" problem; the artifacts are visible in the main body of and can shift the entire latency distribution.

Solution

If you care about latency, reliability, predictability (... insert more *ilities to taste), then the correct thing to do is to never exceed your cpu quota, by setting GOMAXPROCS=max(1, floor(cpu_quota)).

Using this as a default for GOMAXPROCS makes the world safe again, which is why we use uber-go/automaxprocs in all of our microservices.

NOTEs

intentionally avoiding use of the word "core"; the matter of hyper-threading and virtual-vs-physical cores is another topic
/digression: can't not mention userspace scheduler pressure induced by background GC; where are we at with goroutine preemption again?

jcorbin commented 5 years ago

I really have to disagree with some of the latter suggestions in https://github.com/kubernetes/kubernetes/issues/67577 like https://github.com/kubernetes/kubernetes/issues/67577#issuecomment-520603103: using ceil(quota)+2, or in any way over-provisioning GOMAXPROCS vs cpu quota, is at best a statistical gamble to ameliorate current shortcomings in Go's userspace scheduler.

Some background on https://github.com/uber-go/automaxprocs/pull/13 (changing from ceil(quota) to floor(quota)):

over-provisioning seemed reasonable at first when the guess was "if we provision fractional cores, use them"
but later on we ended up needing fractionals to add margin for other supporting processes injected into your container (e.g. by Mesos Aurora's Thermos executor)

I'll reprise (copied with some edits) my description from that issue here for easy reading:

as far as the Go scheduler is concerned, there's no such thing as a fractional CPU

so let's say you have your quota set to N + p for some integer N and some 0.0 < p < 1.0

the only safe assumption then is that you're using that p value as a hedge for something like "systems effects" or "c libraries"

in that latter case, what you really might want is to be able give maxprocs some value K of CPUs stolen for some parasite like a C library or sidecar; but this will always need to be application-config specific

jcorbin commented 5 years ago

Noting: https://github.com/golang/go/issues/19378#issuecomment-304892994 explores some GC-CFS relationship

jcorbin commented 5 years ago

For comparative purposes, Oracle blog post about Java adding similar support ( especially for GC threads )

ianlancetaylor commented 5 years ago

In the original post the links to "cfs bandwidth control" and "this blog post" and "kubernetes issue" do not resolve, so I'm having a hard time understanding this issue.

jcorbin commented 5 years ago

In the original post the links to "cfs bandwidth control" and "this blog post" and "kubernetes issue" do not resolve, so I'm having a hard time understanding this issue.

Oops; fixed

ianlancetaylor commented 5 years ago

CC @aclements @mknyszek

(I'm not sure this has to be a proposal at all. This is more like a bug report. See https://golang.org/s/proposal.)

ianlancetaylor commented 5 years ago

Changing this from a proposal into a feature request for the runtime package.

mknyszek commented 5 years ago

@jcorbin I'm certainly not opposed. After dissecting uber-go/automaxprocs it seems like it requires a bunch of string parsing to really get to the numbers.

This is possible to do from the runtime but also a bit complex. Note that you can't allocate and you need to use raw system calls to process files.

I previously did something similar to get the default huge page size but later found you could just read an integer hiding down in /sys at a path that doesn't change between copies of linux and different environments.

I assume it's not quite so simple with cgroups (even though for bash on my machine that just ends up in /sys) and that you actually need to do this parsing because the paths to the files containing the quota and period could be different depending on your environment.

If it's possible to reach at these values in a simpler way (i.e. just a static path at which there's a file that just contains an integer) that would be preferred.

akshayjshah commented 2 years ago

https://danluu.com/cgroup-throttling/ has been making the rounds lately, and I've seen many more Gophers aware of this issue. It'd be wonderful if the runtime could do the right thing for users by default, even if it takes some slightly thorny string processing.

Is the Go team open to CLs adding this, or is the feature out of scope?

ianlancetaylor commented 2 years ago

We're certainly open to it. I, at least, have no idea what is required to implement it. It would be helpful if someone could move this issue past "we should do this" to "this is how it can be done." Thanks.

guycall commented 1 year ago

FYI, Ruby appears to have an implementation that works across OSes - https://rubyapi.org/3.2/o/etc#method-c-nprocessors

pete-woods commented 1 year ago

It's surely telling the number of other issues/PRs linking to this issue (typically using Uber's automaxprocs module) that this is something the community clearly wants/needs.

ianlancetaylor commented 1 year ago

@guycall Go does that too. That is how runtime.GETMAXPROCS(0) works today.

@pete-woods Do you have a specific proposal for exactly how the Go runtime should change? Thanks.

dprotaso commented 1 year ago

I think the ask is similar to what https://github.com/uber-go/automaxprocs does.

Which is to change the go runtime to adjust GOMAXPROCS depending on the allotted CPU quota. For Linux this implies reading from the /proc filesystem to get cgroup[v2] information.

pete-woods commented 1 year ago

As far as I'm aware that is absolutely not what Go does. It's not cgroups / CFS aware, or there wouldn't be copious references to this deficiency and libraries like the aforementioned Uber one - we'd all just set GOMAXPROCS to zero and be done.

This issue has been open for more than 4 years, Go is extremely commonly deployed within a containerised environment. The JVM fixed this issue years ago, and it is surely not too much to hope Google with its vast quantity of talent and money could address this common use case.

We could read the source of the JVM for inspiration, look at the impl of Uber's library, consider it opt in only when a specific var is set. To me what's lacking here is the attention from Google / Go folks, and clearly having a lot of thumbs ups on the issue isn't enough.

dprotaso commented 1 year ago

I was describing the ideal/expected behaviour - I'll update my comment.

ianlancetaylor commented 1 year ago

@pete-woods I'm not sure who you are replying to, but, just to be clear, when I said above "That is how runtime.GETMAXPROCS(0) works today." I was replying to @guycall 's note about Ruby.

pete-woods commented 1 year ago

Well it was you, but clearly I was confused by the @! At any rate I've nagged on this very quiet issue now, and I don't think further nagging (at least this year) will be helpful.

ianlancetaylor commented 1 year ago

CC @golang/runtime

guycall commented 1 year ago

@ianlancetaylor thanks. So if I understand correctly, it will default to the value of runtime.NumCPU()

On OpenBSD that comes from HW_NCPUONLINE. It is not clear to me if that is the number of cores on the computer or the limit enforced via setrlimit on the process.

The Linux calculation is this. Again I am not sure if this is container aware.

thepudds commented 1 year ago

FYI, I have a WIP CL that I think should hopefully work for cgroups v2. I’ve tested the main pieces separately, but not yet end-to-end. I hope to send a WIP CL maybe tonight or tomorrow, assuming no snags.

To start, it could just serve as a proof of concept for further discussion, but if there’s interest and it seems on target, I’d be happy to try to get it over the finish line, including extending it to handle cgroups v1. Alternatively, I’d of course also be fine if the runtime team preferred to implement something themselves.

thepudds commented 1 year ago

Hi @guycall

The Linux calculation is this. Again I am not sure if this is container aware.

The Go Linux calculation is not currently container aware.

Part of what Ian was saying is that this Ruby code is not container aware either:

https://rubyapi.org/3.2/o/etc#method-c-nprocessors

guycall commented 1 year ago

@thepudds Sorry I misunderstood Ian's point, thank you for the clarification.

However I did run Ruby within a LXC container and it correctly reported the constrained core count of the container. So it does appear Ruby is somewhat container aware.

BenTheElder commented 1 year ago

LXC mounts a fake /proc using https://github.com/lxc/lxcfs, other container environments like Docker do not.

mknyszek commented 1 year ago

One thing I worry about with implicit container awareness is what the impact will be when users update their Go toolchain. Maybe for CPU usage this is relatively safe? It's also important to consider co-tenant situations, i.e. containers with multiple processes sharing resources (though perhaps in that case everyone who cares is already setting GOMAXPROCS, but I don't know how to find out if that's true).

Maybe this is a non-issue, but I'd like to make sure it's considered before we land anything here.

thepudds commented 1 year ago

Hi @mknyszek, I suspect one piece of the puzzle regarding users updating their toolchain might be a GODEBUG that follows #56986 in order to smooth out the transition?

akshayjshah commented 1 year ago

One thing I worry about with implicit container awareness is what the impact will be when users update their Go toolchain. Maybe for CPU usage this is relatively safe? It's also important to consider co-tenant situations, i.e. containers with multiple processes sharing resources (though perhaps in that case everyone who cares is already setting GOMAXPROCS, but I don't know how to find out if that's true).

When go.uber.org/automaxprocs rolled out at Uber, the effect on containerized Go services was universally positive. At least at the time, CFS imposed such heavy penalties on Go binaries exceeding their CPU allotment that properly tuning GOMAXPROCS was a significant latency and throughput improvement. It's been a few years, but I don't recall any reports of performance regressions. (At the time, Uber had ~1.5k engineers actively writing Go and roughly the same number of containerized Go services.) @sywhang is Uber's current Go platform lead and may have more up-to-date experience reports.

@jcorbin touches on multi-tenancy in his early comments on this issue. Multi-tenant containers were common for Uber at the time - a large fraction of the cluster ran Mesos, so the Thermos executor ran in each container along with the Go process. The logic in automaxprocs attempted to take this into account by rounding down CPU allotments to the nearest whole number (with a floor of 1), with the idea that the fractional portion of the allotment was likely for some ancillary process or C library. This approach certainly isn't perfect, but it seems strictly better than (1) having the Go process(es) attempt to use every processor on the host or (2) rolling the dice and rounding up fractional allotments.

Based on that experience, I'm optimistic that implicit container awareness is the right default for the Go community (with a careful GODEBUG-gated rollout).

bradfitz commented 10 months ago

Another blog post about this: https://www.riverphillips.dev/blog/go-cfs/

thepudds commented 10 months ago

I have a working CL for cgroups v2 that seemed to work end-to-end for some basic manual testing.

One thing that caused me to pause was that I then I was trying to dig into what might be possible for testing on the builders, including whether or not a test could (or should) define some test cgroups, and then I missed the window for Go 1.21.

Does anyone have any quick thoughts on approach for testing on the builders?

RiverPhillips commented 9 months ago

@thepudds maybe open the PR and we can see. I had a brief look at the .NET runtime and they didn't have any automated tests when they introduced this functionality, just validated it locally

golang / go