Open josharian opened 7 years ago
Ah. I can fix the idle time by used a buffered channel. (I shouldn't have to, but it works.) The G/proc shuffling remains. New trace:
CL https://golang.org/cl/43110 mentions this issue.
Interesting, thanks for the traces.
What HW is this running on? I ask since lower level cache is probably shared and switching HW threads isn't a bit deal if they are on the same core.
That said the GC tracing is likely to do far more to smash the cache and TLB than moving goroutines between cores on the same chip.
On Tue, May 9, 2017 at 9:23 PM, GopherBot notifications@github.com wrote:
CL https://golang.org/cl/43110 mentions this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/20307#issuecomment-300347038, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wnyhulqgn2rOZc4k9R1OOAbGhvWkKks5r4RGNgaJpZM4NWCiX .
It's running on my laptop (8 core 2.9 GHz Intel Core i7 macOS). Although presumably this happens on all hardware, not just mine.
If it's easy to hack together a patch (unsightly or not) that experiments with memorizing G locations, I'm happy to benchmark it. The concurrent compiler seems like a good test case for it--numcpu long-running memory-heavy calculations.
That said the GC tracing is likely to do far more to smash the cache and TLB than moving goroutines between cores on the same chip.
Perhaps so. Seems worth an experiment, if feasible.
@josharian Do you know why the buffered channel makes such a big difference? Is it because there's a lot of time wasted (for reasons not yet understood) when context-switching (unbuffered case), and there's less context switching needed when there's a buffered channel?
@griesemer I think that's exactly it--those gaps in the first trace are (I suspect) caused by waiting for the scheduler to coordinate handing over work to a new worker, which requires scheduling both the producer and the consumer. The unbuffered channel means that (to a first approximation) there's always work available for a worker without needing the scheduler, thus no gaps.
If I'm reading the goroutine line in the trace correctly there are runnable goroutines that aren't being run during these gaps. If all the goroutines are blocked on some channel then none would be runnable. We need to understand this better regardless of whether the buffered channel work around solves the compiler's issue.
As far as adding CPU affinity to our goroutines I'm a bit nervous. Not only would it complicate the Go scheduler but more importantly it would increase interaction between the Go and the various OS schedulers we support. There are a lot of layers between a goroutine and the HW thread it runs on.
On Wed, May 10, 2017 at 1:00 PM, Josh Bleecher Snyder < notifications@github.com> wrote:
@griesemer https://github.com/griesemer I think that's exactly it--those gaps in the first trace are (I suspect) caused by waiting for the scheduler to coordinate handing over work to a new worker, which requires scheduling both the producer and the consumer. The unbuffered channel means that (to a first approximation) there's always work available for a worker without needing the scheduler, thus no gaps.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/20307#issuecomment-300546972, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wnx2RKbT7Z1D3WYlc_KKB-2oJeFW8ks5r4e0-gaJpZM4NWCiX .
If I'm reading the goroutine line in the trace correctly there are runnable goroutines that aren't being run during these gaps.
That's correct.
As far as adding CPU affinity to our goroutines I'm a bit nervous.
I'm not suggesting CPU affinity. (At least, I don't think I am.) I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW. That's it. But maybe that does involve breaking many layers of abstraction; if so, I understand your reluctance.
I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW.
We do that already, actually. STW leaves the per-P run queues in place. However, it does kick the currently running G to the end of the P's run queue (because it's using the regular preemption path), which is probably why it doesn't look like we have affinity in the execution traces.
That might be relatively easy to hack in a fix for, but it's not going to matter much because we definitely don't maintain P-to-M affinity over STW, so the goroutine will almost certainly be resumed on a different OS thread even if it's still on the same P. That's harder to fix.
yes, gc period goroutine shuffle mix with numa arch will make things event worse
I generated a trace of the concurrent compiler backend, and see some unfortunate things in it.
Screenshot:
On the left hand side of this screenshot is the beginning of concurrent backend compilation. It continues off-screen to the right. The disconcerting things I see are:
For reference, the concurrency (fan-out, etc.) is all contained in a single function, compileFunctions, in pgen.go: https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/pgen.go#L255
Apologies if this is a duplicate. I can split this into two issues if that would be helpful.
cc @aclements @RLH