Open rhysh opened 1 year ago
Before and after doing work for a request, a goroutine needs to interact with the channel to send or receive a permission token.
How much work is done before acquiring the token, and how much work is done before returning it?
Generally in this sort of pattern I would expect the tokens to be acquired before creating the worker goroutines, which limits contention by performing all of the (blocking) semaphore acquire operations on the same goroutine — often on the same thread. That would still leave some contention from releasing the semaphore tokens, but may substantially reduce contention from parking and unparking goroutines with the lock held, because each “elephant” request still only has one goroutine on the slow path.
If the token is acquired before starting the corresponding goroutine, I would only expect to see significant contention on the channel lock if the work per goroutine is very short: something like (\<cache transfer> + \<critical section>) ⋅ \<# cores>. For a single NUMA node the cost of a cache transfer is pretty low (call it 10ns?), and the critical section for releasing a semaphore token (as opposed to acquiring one) should be quite short (call it 40ns?). So with 48 cores I would expect cache contention to really tick up at around 2.4μs/goroutine, which is much shorter than the 100μs per goroutine that you report.
So, my hypothesis is that this workload is creating the goroutines before acquiring the tokens instead of after.
In that case, the observed cache contention would be caused by too many of the goroutines taking the slow (gopark
) path of the semaphore-acquire operation, which would cause the semaphore-releasing goroutines to take the slow path in lock2
(missing the spin section and falling through to futexsleep
), essentially collapsing the channel lock.
So, my hypothesis is that this workload is creating the goroutines before acquiring the tokens instead of after.
Right, the workload involves creating goroutines which each acquire their own token.
Beyond that, not only does the root request create goroutines, but those goroutines can create their own goroutines, which in turn can create more goroutines for generation after generation. In the end, the work is all joined up into a single response, but the fanout within the process is both large and multi-layered. And in the middle, all of the goroutines that are related to a root request share the per-request semaphore that attempts to limit their concurrency.
There isn't a single "before launching the workers" time, since work can involve creation of new workers behind the scenes, workers can create their own workers, etc.
which is much shorter than the 100μs per goroutine that you report
I wrote 100 µs, and I meant it in contrast to the slow 2+ ms waits. I should have written "less than 100 µs"; the longer-running goroutines are much easier see in the go tool trace
UI. By count, the majority of work units take only single-digit-microseconds to run.
From piping go tool trace
's goroutine analysis page (for a trace covering 1 second of the app's execution) through grep -A 1 ' </td>' | grep '<td>' | grep -o '[0-9]*.s' | sed -e 's/ns//' -e 's/µs/000/' -e 's/ms/000000/' | dist
:
N 262494 sum 7.38621e+09 mean 28138.6 gmean 11881.9 std dev 127828 variance 1.63401e+10
min 939
1%ile 1472
5%ile 2304
25%ile 6144
median 9813
75%ile 23000
95%ile 77000
99%ile 254000
max 1e+07
I didn't appreciate the effect that over-estimating would have on your analysis. Sorry.
essentially collapsing the channel lock
I accept that there'll be some limit on the number of operations per second that we can do on a single channel. This app creates around 300k goroutines per second, many of which interact with a channel like this, and it does so in bursts.
What surprises me is that when a single channel reaches that limit, 1/ the performance of those channel operations degrades significantly and swaps to a much slower mode of operation (essentially collapse), and 2/ when it does so, it consumes all of the Ps in the app and blocks forward progress by unrelated goroutines.
I think the collapsed channel consumes the Ps because:
(Compare #8899.)
What version of Go are you using (
go version
)?go1.19.3 linux/amd64
Does this issue reproduce with the latest release?
Yes: go1.19.3 is the latest stable release. I have not tried with tip.
What operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
I have an app that serves HTTP requests and consumes other HTTP services as backends. The work required to process an inbound request is highly variable. Serving the largest requests involves rapid creation of several hundred goroutines, each of which does a bit of work and then allows other parts of the associated request to proceed. Smaller requests have a similar pattern, but may use only a few dozen goroutines.
Because request size is so variable, the program tries to limit the maximum concurrency that any single inbound request can consume, so smaller requests get a chance to use various HTTP clients and to have on-CPU time. It does this by creating a
chan struct{}
for each request, to use the channel's capacity as a semaphore. Before and after doing work for a request, a goroutine needs to interact with the channel to send or receive a permission token.This app runs on machines that provide 48 hardware threads as a single NUMA node. It uses the default GOMAXPROCS setting, leading to 48 Ps.
What did you expect to see?
I expected
chan struct{}
to be an effective way to build a semaphore. I expected the runtime to mediate access to the channel in a way that scales well to 48 threads. I expected the runtime to allow forward progress by goroutines that did not attempt to interact with an especially busy channel.What did you see instead?
The app's work is bursty. When a large request arrives, there's a sudden increase in the amount of available work. Most units of work take around 100 µs to complete. During some bursts of work, the number of runnable goroutines grows to around 1000 and all Ps are able to cooperate in draining that queue.
But during other bursts of work, the execution tracer shows those goroutines taking more than a millisecond of on-CPU time when a P picks them up. The CPUSample events that arrive on goroutines that are taking an unusually long amount of time to do their usual work show that the P is busy trying to get the runtime-internal lock for one of the concurrency-limiting channels.
I think that part of what's going on is that a single goroutine launches a variable number of additional goroutines, and if they don't all fit in the P's local run queue (capacity of 256), they spill out onto the global run queue. Go's scheduler is good at spreading out that work, so if 700 goroutines land there and one of 48 Ps looks for work, it'll pick up about 14 of them and leave the rest for another P to find. Soon, every P in the program has a dozen goroutines in its local run queue, each of which needs to obtain the lock on the single channel they all share (because they were launched by the same goroutine as part of the same request) before doing any real work.
Goroutines related to other requests, whose concurrency-limiting channels are not contended (small / "mouse" requests), are stuck behind that elephant request's goroutines and cannot get running time on a P.
On its surface, this looks like a fine way to write Go code. IIUC, it's common for an "introduction to Go concurrency" to say that goroutines are cheap, to suggest using a shared channel like this for controlling concurrency (if desired), and to advise against worker pools (which in this app's case, we could size to below GOMAXPROCS). Having 48 Ps is a bit big, but it's not even into the realm of NUMA these days.
CC @golang/runtime
Below is an execution trace that shows a few bursts of work that are large enough that no P is left asleep.
First, from 370–380 ms, the process earns 32 CPUSample events. The leaf-most function in those is
runtime.futex
in 10 samples,runtime.procyield
in 4 samples, and 1 sample each inruntime.chanrecv
,runtime.lock2
, andruntime.osyield
. The other 15 samples are in application code and runtime calls unrelated to locking. The execution trace shows relatively quick execution for each goroutine.Second, from 380–390 ms, the process earns 17 CPUSample events. The leaf-most function in those is
runtime.chansend
in 1 sample andruntime.sellock
(for a different type of channel) for 1 sample. The other 15 samples are again in non-lock-related code, and the execution trace shows quick execution.Finally, the problem: From 390–400 ms, the process earns 40 CPUSample events. The leaf-most function in those is
runtime.futex
in 33 samples. The other 7 samples are unrelated to locking. The execution trace shows that when a goroutine runs, it uses the P for multiple milliseconds, much longer than the typical 100 µs.The runtime portion of the CPUSample events looks like these. Note this runs on a Linux machine, so these events correspond to
timer_create
timers delivering a thread-targeted SIGPROF during the kernel's timer interrupt. They arrive in bursts, and the bursts are every 4 ms (CONFIG_HZ=250).