runtime: GC turning on/off disruptive to scheduler and locality

josharian commented 7 years ago

I generated a trace of the concurrent compiler backend, and see some unfortunate things in it.

$ go tool compile -traceprofile=t.p -c=8 fmt/{doc,format,print,scan}.go
$ go tool trace t.p

Screenshot:

On the left hand side of this screenshot is the beginning of concurrent backend compilation. It continues off-screen to the right. The disconcerting things I see are:

Large chunks of idle time. For example, 5 ms of idle time in proc 7 with plenty of work available to be scheduled, right when GC turns off, around 73ms.
goroutines being shuffled around between procs when GC flips on/off (the clear visual vertical lines), which is bad for data locality and caches. Should the STW phase note where Gs are and attempt to put them back there when it is done? Or some such?

For reference, the concurrency (fan-out, etc.) is all contained in a single function, compileFunctions, in pgen.go: https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/pgen.go#L255

Apologies if this is a duplicate. I can split this into two issues if that would be helpful.

cc @aclements @RLH

josharian commented 7 years ago

Ah. I can fix the idle time by used a buffered channel. (I shouldn't have to, but it works.) The G/proc shuffling remains. New trace:

gopherbot commented 7 years ago

CL https://golang.org/cl/43110 mentions this issue.

RLH commented 7 years ago

Interesting, thanks for the traces.

What HW is this running on? I ask since lower level cache is probably shared and switching HW threads isn't a bit deal if they are on the same core.

That said the GC tracing is likely to do far more to smash the cache and TLB than moving goroutines between cores on the same chip.

On Tue, May 9, 2017 at 9:23 PM, GopherBot notifications@github.com wrote:

CL https://golang.org/cl/43110 mentions this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/20307#issuecomment-300347038, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wnyhulqgn2rOZc4k9R1OOAbGhvWkKks5r4RGNgaJpZM4NWCiX .

josharian commented 7 years ago

It's running on my laptop (8 core 2.9 GHz Intel Core i7 macOS). Although presumably this happens on all hardware, not just mine.

If it's easy to hack together a patch (unsightly or not) that experiments with memorizing G locations, I'm happy to benchmark it. The concurrent compiler seems like a good test case for it--numcpu long-running memory-heavy calculations.

That said the GC tracing is likely to do far more to smash the cache and TLB than moving goroutines between cores on the same chip.

Perhaps so. Seems worth an experiment, if feasible.

griesemer commented 7 years ago

@josharian Do you know why the buffered channel makes such a big difference? Is it because there's a lot of time wasted (for reasons not yet understood) when context-switching (unbuffered case), and there's less context switching needed when there's a buffered channel?

josharian commented 7 years ago

@griesemer I think that's exactly it--those gaps in the first trace are (I suspect) caused by waiting for the scheduler to coordinate handing over work to a new worker, which requires scheduling both the producer and the consumer. The unbuffered channel means that (to a first approximation) there's always work available for a worker without needing the scheduler, thus no gaps.

RLH commented 7 years ago

If I'm reading the goroutine line in the trace correctly there are runnable goroutines that aren't being run during these gaps. If all the goroutines are blocked on some channel then none would be runnable. We need to understand this better regardless of whether the buffered channel work around solves the compiler's issue.

As far as adding CPU affinity to our goroutines I'm a bit nervous. Not only would it complicate the Go scheduler but more importantly it would increase interaction between the Go and the various OS schedulers we support. There are a lot of layers between a goroutine and the HW thread it runs on.

On Wed, May 10, 2017 at 1:00 PM, Josh Bleecher Snyder < notifications@github.com> wrote:

@griesemer https://github.com/griesemer I think that's exactly it--those gaps in the first trace are (I suspect) caused by waiting for the scheduler to coordinate handing over work to a new worker, which requires scheduling both the producer and the consumer. The unbuffered channel means that (to a first approximation) there's always work available for a worker without needing the scheduler, thus no gaps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/20307#issuecomment-300546972, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wnx2RKbT7Z1D3WYlc_KKB-2oJeFW8ks5r4e0-gaJpZM4NWCiX .

josharian commented 7 years ago

If I'm reading the goroutine line in the trace correctly there are runnable goroutines that aren't being run during these gaps.

That's correct.

As far as adding CPU affinity to our goroutines I'm a bit nervous.

I'm not suggesting CPU affinity. (At least, I don't think I am.) I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW. That's it. But maybe that does involve breaking many layers of abstraction; if so, I understand your reluctance.

aclements commented 7 years ago

I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW.

We do that already, actually. STW leaves the per-P run queues in place. However, it does kick the currently running G to the end of the P's run queue (because it's using the regular preemption path), which is probably why it doesn't look like we have affinity in the execution traces.

That might be relatively easy to hack in a fix for, but it's not going to matter much because we definitely don't maintain P-to-M affinity over STW, so the goroutine will almost certainly be resumed on a different OS thread even if it's still on the same P. That's harder to fix.

LeGamerDc commented 1 year ago

yes, gc period goroutine shuffle mix with numa arch will make things event worse

golang / go

runtime: GC turning on/off disruptive to scheduler and locality #20307