runtime: findrunnableGCWorker consumed much higher CPU under 64c AWS instance (ARM64), comparing to 32c

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

122.67k stars 17.49k forks source link

runtime: findrunnableGCWorker consumed much higher CPU under 64c AWS instance (ARM64), comparing to 32c #68399

Open innerr opened 1 month ago

innerr commented 1 month ago

Go version

go1.19.13 linux/arm64

Output of `go env` in your module/workspace:

GOARCH="arm64"

What did you do?

We put one binary (developed/compiled by golang) in an AWS EKS instance, it runs as a gRpc server.
Lately we changed this instance type, increasing CPU cores from 32c to 64c.

What did you see happen?

We found that this process had much higher P999 latency comparing to previous.
From flame graph, We found out findrunnableGCWorker consumed much higher CPU comparing to previous.

Here are the graphs:

ARM64 32c:
ARM64 64c:

What did you expect to see?

The latency of this server process shouldn't change much when we changed from 32c instance to 64c instance.

seankhliao commented 1 month ago

1.20 isn't a supported release, can we see the results from the latest version?

rhysh commented 1 month ago

Yes, reproducing on a newer Go version would make it easier to analyze the problem and perhaps to fix.

It's interesting to see runtime.(*gcControllerState).findRunnableGCWorker as the part of runtime.findRunnable that is consuming on-CPU time. That's guarded by a check of gcBlackenEnabled, so implies that the program is in the GC Mark phase.

Maybe the GC is running very often, which in turn has other effects that cause the app's increased tail latency. In that case, arranging for the GC to run less often (without changes to Go) might resolve your application's problems. Or, looking more deeply into the tail latency itself might show something about Go. Execution traces (the runtime/trace package) give a more direct view into latency.

One of the behaviors I've seen in the scheduler is that when there are a lot of Ps (higher GOMAXPROCS, since that's the number of runtime.p values in the program) it takes longer for an individual thread to find the runnable goroutines. That may be offset by the increased number of threads which are looking! When I've seen it, it's shown up as more on-CPU time in runtime.findRunnable's calls to runtime.stealWork.

It's interesting to see that your application ends up with higher latency. As I understand Go's scheduler, running a program with higher GOMAXPROCS would lead to higher CPU usage due to idle threads searching for work rather than going to sleep. But in a single-tenant environment -- with no other (big) programs running on the machine -- I'd expect the effect of that to be limited to keeping the CPUs powered up for longer than necessary, to waste electricity but to still have good latency in the app.

You mentioned AWS EKS, which makes me wonder if the app is running alongside other Go apps on the same machine/VM. If so, the increase in on-CPU times in the other Go apps (from the effect I described above) could be what's causing the latency.

It looks like the flame graph tool is reordering the functions based on their contribution to the on-CPU time total (rather than remaining alphabetical), which makes the two charts a bit hard to compare. Especially with the names being truncated. Sharing the data in a less processed form may be helpful, maybe the svg/web/png view of go tool pprof with -show_from=^runtime\. and/or -focus=^runtime\.findRunnable$.

The -list output for findRunnableGCWorker could also narrow down why that function is on-CPU for so much time, though I wonder if it's mainly due to a large number of calls rather than any of those calls being slow.

Looking at growth of the /sched/latencies:seconds buckets in the runtime/metrics package (summed, and multiplied by 8 to offset the sample rate of runtime.gTrackingPeriod) can give a view into how often the scheduler needs to make decisions.

innerr commented 1 month ago

Sorry it should be go1.19.13, my bad, and thanks for the rapid response!

innerr commented 1 month ago

By the way, this fig below is the effect to the server process, the latency is really bad at 64c: (single process in the k8s pod, no other apps) 20240713-135251

innerr commented 1 month ago

Here are the flame graphs: (*.proto) profile-2024-07-13_06-26-00.zip profile-2024-07-07_06-28-00.zip

rhysh commented 1 month ago

Sorry it should be go1.19.13, my bad, and thanks for the rapid response!

Thanks for sharing the data. Those profiles work well with go tool pprof, including to annotate the matching lines of runtime source code.

I see that the Build ID for those two profiles are identical (e3ac319ed9aafff97d9ecf080dc2d206dd998382).

I see that the amount of on-CPU time they spend in several types of application goroutines is similar between the two profiles -- looks like they're receiving the same total amount of offered load (so half as much per core on the 64-core machine).

$ go tool pprof \
    -source_path=$HOME/sdk/go-bisect \
    -trim_path=/usr/local/go \
    -call_tree \
    -ignore=(runtime.schedule|runtime.gcDrain) \
    -web \
    /path/to/profile.pb.gz

From spot-checking differences in on-CPU time, I see that the 64-core profile shows more time in runtime.markroots. It's hard to know how much to read into this (or what to take from it) ... the CPU profiles average the application's behavior for their time window (10 seconds, in this case), so maybe these are capturing different portions of some small number of GC cycles.

Another overall takeaway from the CPU profiles is that the 32-core profile shows an average of about 10 cores of on-CPU time, and the 64-core profile shows an average of about 17 cores of on-CPU time. These don't appear (on average at least) to be starved for CPU time.

single process in the k8s pod, no other apps

With no other pods scheduled on the machine?

The change in how the program spends its CPU time is suspicious, but I don't see how that could connect to a delay in processing application work. When many Ps are in runtime.(*gcControllerState).findRunnableGCWorker's call to runtime.(*lfstack).pop, I'd expect that 1/ at least one of them is able to make progress, and so move on to doing other work and 2/ that even if there's a particular P that ends up stuck in lfstack.pop indefinitely, that any Gs on its local run queue would be available for other Ps to steal and execute.

I do see a difference in the amount of GC Assist work: from go tool pprof -peek=gcDrainN, 0.18% of the total matches for the 32-core profile, versus 2.39% for the 64-core profile. All of that is in the critical path of application work (and may also be in the critical path of the application's latency-sensitive work).

There may be two or more problems here. One relating to GC Assist, which could be a data point for Go's pacer design or could be worked around within your app. Another relating to the high on-CPU time in lfstack.pop.

Some thoughts on how to make analysis easier: 1/ A more recent (supported) version of Go, such as the Go 1.22 series. It's possible the bug is fixed, or is different, and in any case obtaining a fix for the bug will likely require updating the Go toolchain. 2/ Capturing execution traces (as in the runtime/trace package), and if possible having the app capture a CPU profile at the same time (which will result in the execution trace data stream including timestamped CPU profile samples, attached to the thread that observed them). It sounds like the main problem for your application is one of latency. The execution trace is the tool for measuring latency, including detailed sub-microsecond views into the behavior of the individual goroutines that collaborate to serve each request.

The golang.org/x/exp/trace.FlightRecorder may help to capture an execution trace that shows the problem. It requires Go 1.22.

CC @golang/runtime, and (pacer!? or at least GC worker management) @mknyszek

innerr commented 1 month ago

Appreciate the comprehensive response!

Since this issue happened on product env, we can't simply change the online app (change to go1.22 build), So we are trying to re-produce this issue in offline env, then use go1.22 to capture necessary info.

If we are lucky, I should come back with new captured data in days. Thanks again!

prattmic commented 1 month ago

findRunnableGCWorker is one of the first checks in schedule. I wonder if this program is entering the scheduler extremely frequently (either due to explicit Gosched calls, or just short-lived runnable goroutines). Then all of the Ms are piling up on gcBgMarkWorkerPool. A trace would help to see if frequent scheduling is the case.

However, if that is the case, it is surprising we don't see other parts of the scheduler, for when the GC is not running.

innerr commented 1 month ago

Hi all, we ran some tests, and I am back with some test results and captured data.

Tests we did:

Ran 64C/32C comparation on AWS EKS (ARM instances), re-produced the issue.
We used "MAXPROCS=32" on 64C instance to simulate 32C instance, not ideal but we got the expected result.

Some things we found in the results:

The issue mitigated after restarting the App:
- We restarted our App, the latency reduced a lot
- But still not as good as 32C.
The issue mitigated slightly in go1.22.
GOGC=? is helpfull:
- This config reduced the tail latency in our tests, it also increase the latency jitter slightly but seems worth it comparing to longer tail latency.
- The GOGC=? related fig is here:

And we captured flames with go1.22:

32C:
- Captured data: 730-debug-32c-go1.22.zip
- App performance:
64C:
- Captured data: 730-debug-64c-go1.22.zip
- App performance:

rhysh commented 1 month ago

Thanks for the update. Are you able to share the execution traces you collected?

Looking inside the zip files for the 32c and 64c experiments: The ./goroutine files make it look like the 32c process had been running for 7 minutes at the time of the snapshot and the 64c process had been running for 11 minutes. The ./mutex file shows 557 seconds of runtime._LostContendedRuntimeLock for the 32c process and 3500 seconds for the 64c process. That's about 1.3 thread-seconds and 5.3 thread-seconds respectively. That seems like a lot.

Using the CPU profiles in ./profile as a very rough estimate of which runtime-internal locks are contended via go tool pprof -peek '^runtime.lock($| )' -lines ./profile, I see a bit of sellock (to acquire channel locks inside a select statement, ok) but I see a lot in lines that reference sched.lock — and more in the 64c profile than with 32c.

This looks like evidence in favor of @prattmic 's suspicion that the program is entering the scheduler extremely frequently. That might be due to behavior of app-specific code, or due to behavior of a package the app uses (such as http2/grpc).

The 64c ./mutex profile shows 165e6 contention events taking 3500 seconds total, for an average of 21 µs per event. For 32c, it's 57e6 and 557, for an average of 9.8 µs per event. Those two processes also had a difference in the amount of sched.lock contention versus the amount of contention in sellock. It could be that the work within the locks' critical section is scaling poorly, or it could be that the work while holding sched.lock takes longer than the work done in selectgo (to implement select, following the contention in sellock). Or it could be that lock handoff is slower when more threads are at play. Or something else.

An execution trace, preferably one that was captured concurrently with an active CPU profile, would give a more complete view of what the app is doing and why it might be having trouble, rather than needing to piece bits together from the other views.

I have an idea for how to address the high CPU time of findRunnableGCWorker, but I'm not sure that's the cause of the latency. (Other processors should be able to pick up the runnable userspace goroutines.) I'll send a patch.

gopherbot commented 1 month ago

Change https://go.dev/cl/602477 mentions this issue: runtime: avoid futile mark worker acquisition

rhysh commented 1 month ago

I think that https://go.dev/cl/602477 will fix the headline issue of "findRunnableGCWorker using more CPU on larger machines". It's been merged into the development branch, so should be part of the Go 1.24 release in about 6 months. If you're able to run your workload with a patched version of Go, it would be helpful to know how it changes the behaviors you've reported.

I don't think it will solve the higher tail latencies you've seen on larger machines (though it is a step towards that). I think the reason you're reporting the increase in findRunnableGCWorker's CPU usage is that your app is 1/ running on a large machine, and 2/ interacting with the scheduler very, very often. The CPU usage of findRunnableGCWorker is the easiest to see (on CPU profiles) and perhaps to fix, but there'll be others. The various sources of contention on sched.lock are probably next. An execution trace (with concurrent CPU profiling) will help to put that behavior into context.

The app might benefit from app-level or library-level changes to reduce how often it interacts with the scheduler. But I think the scheduler will benefit from continued issue reporting here.

Thank you for reporting, and sharing the detailed data.

innerr commented 1 month ago

Thank you guys a lot! We are in internal discussion about this, should be back in some days.

nolouch commented 4 weeks ago

Hi @rhysh, Thank you for your investigation.

Here is the trace file for https://github.com/golang/go/issues/68399#issuecomment-2262140473. with go.1.19 (The time trace captured in the yellow section of the test corresponds to the first image of the comment.) trace file link. If it is helpful or needs more data I can provide more by reproducing it within my environment. Via go tool pprof -peek '^runtime.selectgo($| )' -lines ./profile, It shows most of select may cause of http2/grpc library:

----------------------------------------------------------+-------------
                                            10.21s 41.24% |   golang.org/x/net/http2.(*serverConn).writeFrameFromHandler /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:1137
                                             3.95s 15.95% |   golang.org/x/net/http2.(*serverConn).serve /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:940
                                                2s  8.08% |   golang.org/x/net/http2.(*serverConn).readFrames /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:821
                                             1.52s  6.14% |   golang.org/x/net/http2.(*serverConn).readFrames /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:826
                                             1.51s  6.10% |   golang.org/x/net/http2.(*serverConn).noteBodyReadFromHandler /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:2436
                                             1.48s  5.98% |   golang.org/x/net/http2.(*serverConn).writeDataFromHandler /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:1101
                                             1.33s  5.37% |   google.golang.org/grpc/internal/transport.(*serverHandlerTransport).do /Users/nolouch/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/handler_server.go:194
                                             1.09s  4.40% |   google.golang.org/grpc/internal/transport.(*serverHandlerTransport).runStream /Users/nolouch/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/handler_server.go:443
                                             0.77s  3.11% |   google.golang.org/grpc/internal/transport.(*recvBufferReader).read /Users/nolouch/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/transport.go:179
                                             0.45s  1.82% |   golang.org/x/net/http2.(*serverConn).writeHeaders /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:2402
                                             0.27s  1.09% |   golang.org/x/net/http2.(*serverConn).sendServeMsg /Users/nolouch/go/pkg/mod/golang.org/x/net@v0.17.0/http2/server.go:1036
                                             0.18s  0.73% |   google.golang.org/grpc/internal/transport.(*serverHandlerTransport).HandleStreams.func1 /Users/nolouch/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/handler_server.go:361
     0.06s 0.0071%  3.40%     24.76s  2.92%                | runtime.selectgo /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/select.go:230
                                            15.11s 61.03% |   runtime.sellock /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/select.go:39
                                             8.24s 33.28% |   runtime.sellock /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/select.go:33
                                             0.51s  2.06% |   runtime.sellock /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/select.go:37
                                             0.42s  1.70% |   runtime.sellock /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/select.go:36
                                             0.40s  1.62% |   runtime.sellock /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/select.go:35
                                             0.02s 0.081% |   runtime.sellock /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/select.go:42
----------------------------------------------------------+-------------

So it could be a common issue.

BTW, I also try to reproduce it in another environment with the same methods, but the runtime shows a higher cost(almost 5000% CPU with 64c). here is the profile: idc-arm64-debug-64c-go1.19.zip, idc-arm64-debug-64c-go1.22.zip.

rhysh commented 4 weeks ago

Thanks @nolouch .

I'm not so worried about selectgo. Contention there isn't necessarily a problem for the whole process, and it's generally under control of the application with potential for app-level fixes. See for example #20351 as a pattern where it shows up (a "done" / "shutdown" channel for a server), and some discussions of workaround (app-level fanout). Or #57070 , which also has app-level options (instead of creating goroutines that each need to interact with the channel, do the channel operation before creating the goroutine).

The go1.22 profile you shared today from "another environment" shows a very high level of contention on sched.lock. One of the problems we have with runtime-internal locks today is that short critical sections can result in all of the waiting threads being/remaining/becoming awake and checking repeatedly if the lock is available, which slows down updates to the cache line that contains the mutex's state. Waiting threads make accessing the lock slower, which means more threads end up waiting, which means the lock gets even slower. It can be a pretty bad form of congestion collapse. I think this is a big problem, and also that we can fix it, which I'm tracking at #68578. I have a proof-of-concept on how at https://go-review.googlesource.com/c/go/+/601597/3 , which may be of interest to try if you're able (and willing) to run a patched version of Go in a test environment.

But, it also shows a reason for the contention on sched.lock. Many of the samples that end up in lock2/unlock2 go through goschedImpl, which means something in the app is calling runtime.Gosched. The CPU profile shows one sample in https://github.com/grpc/grpc-go/blob/v1.58.3/internal/transport/controlbuf.go#L588 .. maybe there are others in the app. The current implementation of Gosched requires taking the process-wide sched.lock so it can put the goroutine onto the global run queue. And removing the goroutine from that run queue involves taking the lock again. The behavior of highly contended runtime.mutex values isn't the best, the behavior of Gosched isn't the best, but the app (or, a library used by the app) is what's making the choice to call runtime.Gosched. So there's a clear workaround in the short term: don't call Gosched, especially not often, especially not on big machines.

Thanks for sharing the execution traces. They're pretty big, and aren't opening quickly on my machine. (They're in the Go 1.19 format ... I think that fast loading of large traces should be better with the v2 tracer that shipped with Go 1.22.) I'll give a more thorough try later. But if you have them from a Go 1.22+ version of the program, that would probably work even better.

rhysh commented 4 weeks ago

I moved trace-PROC64-GOGCdefault.gz to a machine with more RAM and was able to finish opening it. (It looks like go tool trace needed up to 51 GB RSS while opening that 600 MB file.)

Using go1.23rc2's go tool pprof -d=1 command and trimming to the first 100,000,000 lines, I see 1.3 seconds of wall-clock time (grep '^M' | tail), 4 GC cycles (grep '^M' | grep Metric | grep '/gc/heap/goal:bytes'), 1,900,000 transitions into the "Running" state (grep '^M' | grep -- '->Running' | wc -l)

If we were to average across GOMAXPROCS=64, that would be a transition into "Running" every 44 µs on each thread. But the threads aren't all running all the time, so the "Running" events are bunched together more closely while a thread is active. So a typical goroutine execution lasts for less than 44 µs. Spot-checking between 760 ms and 860 ms, I see many goroutines that run for single-digit microseconds. So yes, this is an app that interacts with the scheduler very frequently.

I don't see any CPU profile sample events in the execution trace data stream.

During some millisecond-scale time windows, such as between 809.5 ms and 810.5 ms (pictured below), I see many google.golang.org/grpc.(*Server).serveStreams.func1.1 goroutines spend a longer than usual amount of time in the "Running" state. This might be something like #57070 — that's a way that temporary contention on a runtime-internal lock can show up in an execution trace. The CPU profile samples would help to reveal what kind of work those goroutines were doing at that time (and how it might be different from the kind of work they do when they only need to be on-CPU for a few microseconds).

You're looking for something that slows the p99 down by a few milliseconds. This is something occasional and unusual that lasts on the order of a millisecond.

If you're having trouble getting a CPU profile and execution trace to run at the same time, you might try using https://pkg.go.dev/github.com/rhysh/autoprof . Configure its ArchiveOptions with both CPUProfileDuration and ExecutionTraceDuration ... 1 second is probably enough for the "GOGC default" case ... pass that to a single-use NewZipCollector, and call Run.

nolouch commented 3 weeks ago

@rhysh Here is the data with go patch(https://go-review.googlesource.com/c/go/+/601597/3) vs go1.22. profile with autoprof.

	P90	P99	P999	P99999
GO 1.22 64c	499us	1.91ms	7.98ms	30.2ms
Go 1.22 32c	629us	1.70ms	4.64ms	14.8ms
GO 1.24 patch 64c	469us	1.23ms	5.95ms	29.4ms
GO 1.24 patch 32c	484us	1.10ms	3.61ms	13.8ms

The patch has some minor improvements, but it's unclear whether the improvement is due to the PR changes or other changes. also, the difference of tail latency between 32c and 64c is still significant.

rhysh commented 3 weeks ago

That's great data, thank you! From the bit that I've been able to digest so far, the worst of it looks (at a high level) like a single behavior of your app is causing it to bump into a combination of a few suboptimal behaviors in the runtime.

My read of the autoprof data bundles is that your app is running with GOMAXPROCS of 32 or 64 and is serving a high volume of latency-sensitive requests (which involves frequent allocation), but has a live heap of only 100–150 MB. The GC is running every 120–205 ms, with the mark phase lasting around 15 ms. If there's nothing else running on those machines, there's probably 60 GB or more of idle RAM that the process could use to spread out the time between its garbage collections. I'd start slow, by inviting the runtime to use 10 GB of it for garbage. You can do that by running with GOMEMLIMIT=10GiB, or with something like var ballast = make([]byte, 10<<30) in a package scope. See https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i-learnt-to-stop-worrying-and-love-the-heap/ for a description of a similar app that got good results with that technique.

Another option, strange as it sounds, is to call runtime.GC in a loop (with no delay between calls). Or, to have the loop delay until some app-specific code judges that the time is right to start a GC cycle (probably based on data from runtime/metrics, plus some app-specific tuning). I'd call this a "user-space GC pacer". (The runtime has a pacer that decides how soon to start the GC cycle, but for some reason its estimates are off for your app.) The goal of that would be to eliminate the CPU time in gcDrainN. If you can get the on-CPU time there down to zero, then it would be interesting to know how the tail latency changed. CC @mknyszek

I didn't see the behaviors that showed up in the other profiles/traces, and I haven't looked into why. I'm concerned that there are several concurrent problems, and that if we try to eliminate only a single one at a time we might conclude that none of them—by itself—fixes your app's tail latency.

I'd expect the lfstack usage in findRunnableGCWorker to still be an issue (the go1.24-patch data is from a checkout of CL 601597, so doesn't include CL 602477). It still shows up, but far less than in the 730 dataset. And, I haven't seen many Ps all keep their goroutines for 1–2 ms at a time, like I saw in my last post. But that's from a non-exhaustive look at the execution traces.

From the runtime side, I see high levels of GC Assist work. That shows up as runtime.gcDrainN in the CPU profiles rather than runtime.gcDrain. Something about the app is making the runtime's GC pacer start the cycle "too late", so it feels a need to assign GC-related work to the application's own goroutines instead of being able to do everything in gcBgMarkWorker goroutines. See for example the GC cycle from 600–620 ms in the "debug-0810-go-1.24patch-64c" data set.

Then, I see that there's a lot of on-CPU time in the lfstack that's used by runtime.getempty, especially as called via gcDrainN. That manages GC workbuf values .. it's a different lfstack than the last one that gave trouble, but it's still an lfstack. Maybe that doesn't work so well when there are a huge number of cores (64), or on arm64, or on particular implementations of arm64. (Maybe a difference in whether LSE is available? I don't know.) It takes more CPU time in the 64c data than in the 32c data. I think @mknyszek is the closest to this code and design.

Slowness during each GC cycle, lasting up to 15 ms and occurring every 150 ms, could explain tail latency beyond the p90. Setting a max heap size of 10 GiB would make GC cycles happen about 100x less often. (I'm surprised that the GOGC=1000 experiment didn't work better.)

Incidentally, I saw that the autoprof data is going into a bytes.Buffer which needs to be grown on occasion. That buffer also ends up large relative to the heap, and each time it grows is a large allocation and a large copy. I have some code in https://pkg.go.dev/github.com/rhysh/autoprof/periodic to help even that out (the llbuf), but the API it has today isn't especially helpful. It might skew the data in the execution trace a bit, but it doesn't explain the tail latency when this code isn't running :)

I used to have tools (in https://pkg.go.dev/github.com/rhysh/go-tracing-toolbox) to extract examples of slow requests from the execution trace, but I haven't ported them to the new trace parsing API (or added grpc support). Maybe I'll change that over the next few weeks.

mknyszek commented 3 weeks ago

Sorry for the lack of reply here. I tried to read over all the comments, but it's possible I missed something. Apologies in advance if I end up restating something that was already mentioned above.

I think Rhys has done a great analysis; below I respond primarily to the GC assist questions, though I think a lot of this is relevant to most of the details of Rhys' most recent comment.

From the runtime side, I see high levels of GC Assist work. That shows up as runtime.gcDrainN in the CPU profiles rather than runtime.gcDrain. Something about the app is making the runtime's GC pacer start the cycle "too late", so it feels a need to assign GC-related work to the application's own goroutines instead of being able to do everything in gcBgMarkWorker goroutines. See for example the GC cycle from 600–620 ms in the "debug-0810-go-1.24patch-64c" data set.

An increase in assist work with increased GOMAXPROCS is interesting. Whether or not it's "too much" is hard to say without additional details. If it's at all possible to run your program under GODEBUG=gctrace=1,gcpacertrace=1, the output (which will go to STDERR) will help a lot in determining that.

Here are a couple theories as to why there's additional assist work:

(1) The GC measures its own performance. If we're hitting a scalability bottleneck in the GC (like below), then the GC will measure lower performance, causing it to be more conservative and force more assists to keep up. The GC should still be starting earlier to offset that, but it's possible you're running up against some of the internal limits as to how early a GC is allowed to start running. The GODEBUG setting I mentioned above will help to determine that.

(2) Does each 64-core instance of your application accept a bigger load? (Like, when you scaled up to 64 cores from 32, did you also start using half the instances?) Also, what is the difference in memory footprint with the 32-core and 64-core version? I ask these questions because one way you can end up in this situation is due to how GOGC works. If your GOGC setting is the default, 100, and memory footprint did not grow proportionally (2x), then with 64 cores and a 2x increase in load you could be allocating memory 2x faster. The GC is thus operating with proportionally less runway than it was before.

Let me try to illustrate:

32-core -> 2k QPS results in 500 MiB/s allocated. With a 100 MiB live heap, 200 MiB total heap footprint -> 1 GC cycle every 200 ms. 32-core -> 4k QPS results in 1000 MiB/s allocated. With a 100 MiB live heap, 200 MiB total heap footprint -> 1 GC cycle every 100 ms.

Since in this example the live heap is constant, each GC cycle is doing a constant amount of work. But now your GC overhead is 2x! The fix to this would be to increase GOGC to trade memory use for a return to your previous GC overheads (modulo whatever you're losing from scalability bottlenecks and the like). Rhys' suggestions about heap ballasts and calling runtime.GC are also possible fixes in this scenario, though IMO it would be nice to avoid them and fix the issue more directly. :) There's also GOMEMLIMIT which you can use to protect yourself against increases to GOGC causing problems when the heap size is spiky, leading to OOMs.

Honestly, I suspect there's probably a bit of both (1) and (2) going on. Also, I'd recommend moving to at least Go 1.20, since Go 1.19 had a somewhat more erratic pacer which could lead to more assists in more edge cases. Having a look at the extra STDERR output from your program when running with GODEBUG=gctrace=1,gcpacertrace=1 will help a lot.

Then, I see that there's a lot of on-CPU time in the lfstack that's used by runtime.getempty, especially as called via gcDrainN. That manages GC workbuf values .. it's a different lfstack than the last one that gave trouble, but it's still an lfstack. Maybe that doesn't work so well when there are a huge number of cores (64), or on arm64, or on particular implementations of arm64. (Maybe a difference in whether LSE is available? I don't know.) It takes more CPU time in the 64c data than in the 32c data. I think @mknyszek is the closest to this code and design.

There are definitely some lfstack bottlenecks that we're aware of. For example, gcDrain and gcDrainN both hammer on work.full == 0 (an lfstack) which gets written to and checked frequently. That will show up as cycles directly attributed to those two functions.

zhangfannie commented 1 week ago

Some time ago, we encountered a similar problem, and we also located (*lfstack).pop as the bottleneck. We tried a random exponential backoff mitigation that might work well for Go GC and memory management in the presence of contention. This drastically reduced the CPU overhead of (*lfstack).pop. But this is just a prototype of optimization.

func (head *lfstack) pop() unsafe.Pointer {
    var delay uint32 = 1

        for {
                old := atomic.Load64((*uint64)(head))
                if old == 0 {
                        return nil
                }
                node := lfstackUnpack(old)
                next := atomic.Load64(&node.next)
                if atomic.Cas64((*uint64)(head), old, next) {
                        return unsafe.Pointer(node)
                }

        usleep(delay)
        delay *= 2
        }
}

nolouch commented 1 week ago

Sorry for the delay in my response. GOGC has already been tested, and while it provides some benefits as noted in https://github.com/golang/go/issues/68399#issuecomment-2262140473, it doesn't perform as well as using 32 cores. I will also try using the ballast space, but in the production environment, memory usage exceeds 10GB, unlike my test environment where it is only several hundred MB. The problem persists. I will simplify my test, run it again in a few days, and provide you with feedback. Thank you for your answers.

golang / go

runtime: findrunnableGCWorker consumed much higher CPU under 64c AWS instance (ARM64), comparing to 32c #68399

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

Output of `go env` in your module/workspace: