Open rhysh opened 8 years ago
/cc @aclements @RLH
A quick sanity check -- does this mean that the total number of goroutines on your 8-core box is about 800? (40 copies times 20 goroutines). What's the OS scheduling quantum? How many of those per-process goroutines are eligible to run?
My first, lightly-informed guess is that some of the "running" goroutines are instead waiting to be given a core by the kernel, and they're not waiting at a Go safe point, so the thread cannot proceed to a safe point until the kernel says it can run, and the GC cannot finish a phase until all the threads have proceeded to a safe point.
Here's the execution trace of a 77ms mark termination pause:
Proc 0 does mark termination 72ms after Proc 4 begins the phase, and 68ms after the other straggler (Proc 1) observes the phase. There's an additional 5ms delay between when Proc 0 finishes its mark termination work and when Proc 4 declares the phase complete.
@dr2chase Yes, there would be around 800 goroutines total on the machine.
From /proc/sys/kernel/sched_rr_timeslice_ms
, the scheduling quantum appears to be 25ms.
Each instance of the program is generally idle, waiting for a record to come on stdin. When one arrives, it's processed by a single goroutine. That goroutine later hands the data off to another goroutine which does some more analysis. Each process usually has 0 running goroutines. Sometimes they'll have 1 running goroutine. Occasionally for bursts of around 100µs there'll be up to three goroutines running in parallel in a process.
The total CPU usage of all of the programs on that machine—including the 40 Go programs and the JVM process that feeds them data—is well below 8 cores when averaged over several seconds. If the threads are unable to execute, it's not from lack of CPU—at least in the average case. Are you asking if there might be sub-second bursts of high CPU demand?
I think the execution trace indicates that all goroutines/threads are at a safe point: there appears to be a "proc stop" event following each goroutine execution before the start of the mark termination phase. There may be a problem getting each P to be picked up by an M (with a core from the OS) in order to run the mark termination phase for that P .. but it doesn't look to me like there are goroutines pausing short of a safepoint.
I set /proc/sys/kernel/sched_rr_timeslice_ms
to 1 on one machine, which doesn't seem to have had a significant impact on the GC pause durations:
Sweep termination pause distribution:
N 1000 sum 10702.4 mean 10.7024 gmean 0.591121 std dev 20.7411 variance 430.195
min 0.011
1%ile 0.014
5%ile 0.016
25%ile 0.022
median 0.355
75%ile 12
95%ile 52.65
99%ile 101.317
max 178
⠀⠀⠀⠀⠀⠀⢰⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.180
⠀⠀⠀⠀⠀⠀⠌⠈⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠅⠀⠲⠒⠲⠶⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 50 100
Mark termination pause distribution:
N 1000 sum 6475.96 mean 6.47596 gmean 0.931998 std dev 13.2368 variance 175.213
min 0.04
1%ile 0.051
5%ile 0.06135
25%ile 0.1
median 0.835
75%ile 6.55833
95%ile 32
99%ile 65.6633
max 147
⠀⠀⠀⠀⠀⠀⢐⠢⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.303
⠀⠀⠀⠀⠀⠀⠌⠨⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠅⠈⠓⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 50
40 Go programs and a JVM competing for 8 HW threads managed by the OS could result in one or more of the threads backing GOMAXPROCS being starved. The fact that it is the thread is a GC thread isn't special. Go hasn't attempted to solve these types of co-tenancy problems, they are hard. The GC assumes that it is omnipotent and that GOMAXPROCS represents the number of HW threads at its disposal.
The one thing that might help is that if you reduce GOMAXPROCS to 2 or 4 so that when a GC does start it doesn't grab all 8 of the HW threads.
On Thu, Nov 17, 2016 at 6:29 PM, Rhys Hiltner notifications@github.com wrote:
@dr2chase https://github.com/dr2chase Yes, there would be around 800 goroutines total on the machine.
From /proc/sys/kernel/sched_rr_timeslice_ms, the scheduling quantum appears to be 25ms.
Each instance of the program is generally idle, waiting for a record to come on stdin. When one arrives, it's processed by a single goroutine. That goroutine later hands the data off to another goroutine which does some more analysis. Each process usually has 0 running goroutines. Sometimes they'll have 1 running goroutine. Occasionally for bursts of around 100µs there'll be up to three goroutines running in parallel in a process.
The total CPU usage of all of the programs on that machine—including the 40 Go programs and the JVM process that feeds them data—is well below 8 cores when averaged over several seconds. If the threads are unable to execute, it's not from lack of CPU—at least in the average case. Are you asking if there might be sub-second bursts of high CPU demand?
I think the execution trace indicates that all goroutines/threads are at a safe point: there appears to be a "proc stop" event following each goroutine execution before the start of the mark termination phase. There may be a problem getting each P to be picked up by an M (with a core from the OS) in order to run the mark termination phase for that P .. but it doesn't look to me like there are goroutines pausing short of a safepoint.
I set /proc/sys/kernel/sched_rr_timeslice_ms to 1 on one machine, which doesn't seem to have had a significant impact on the GC pause durations:
Sweep termination pause distribution:
N 1000 sum 10702.4 mean 10.7024 gmean 0.591121 std dev 20.7411 variance 430.195
min 0.011
1%ile 0.014 5%ile 0.016 25%ile 0.022 median 0.355 75%ile 12 95%ile 52.65 99%ile 101.317 max 178
⠀⠀⠀⠀⠀⠀⢰⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.180 ⠀⠀⠀⠀⠀⠀⠌⠈⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇ ⠠⠤⠤⠤⠤⠤⠅⠀⠲⠒⠲⠶⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000 ⠈⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁ 0 50 100
Mark termination pause distribution:
N 1000 sum 6475.96 mean 6.47596 gmean 0.931998 std dev 13.2368 variance 175.213
min 0.04
1%ile 0.051 5%ile 0.06135 25%ile 0.1 median 0.835 75%ile 6.55833 95%ile 32 99%ile 65.6633 max 147
⠀⠀⠀⠀⠀⠀⢐⠢⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.303 ⠀⠀⠀⠀⠀⠀⠌⠨⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇ ⠠⠤⠤⠤⠤⠤⠅⠈⠓⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000 ⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁ 0 50
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/17969#issuecomment-261403206, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn5K6SBpMeRTF4T2CIrCjw9lDb24rks5q_ONogaJpZM4K14BK .
I've changed GOMAXPROCS to 2 and the tail latencies are now significantly better-controlled. Thanks @RLH!
The STW phases need to happen on each P .. why are GOMAXPROCS Ms required to do that? Without addressing co-tenancy in general, could the scheduler be adjusted so that once the bit of STW bookkeeping is done on a particular P, that the M would release that P and attempt to grab a P that still needs to complete the phase? A change like that might allow more programs to meet the GC pause goals without requiring tuning (of GOMAXPROCS).
Here are the distributions of the mark termination pauses with go version devel +1e3c57c Wed Nov 16 20:31:40 2016 +0000
(meaning that the loop preemption patch is not active):
GOMAXPROCS unset (defaulting to 8):
N 10000 sum 59861.4 mean 5.98614 gmean 0.911389 std dev 12.7627 variance 162.888
min 0.039
1%ile 0.054
5%ile 0.064
25%ile 0.1
median 0.81
75%ile 6.3
95%ile 29
99%ile 65
max 186
⠀⠀⠀⠀⠀⠀⠰⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.461
⠀⠀⠀⠀⠀⠀⡁⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠄⠓⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 50
GOMAXPROCS manually set to 2:
N 10000 sum 8406.86 mean 0.840686 gmean 0.0734456 std dev 3.39466 variance 11.5237
min 0.012
1%ile 0.018
5%ile 0.022
25%ile 0.032
median 0.04
75%ile 0.055
95%ile 5.3
99%ile 15.6633
max 90
⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 33.757
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 10 20
The problem isn't that the GC STW needs GOMAXPROCS HW threads to do its job. However, any critical path needs at least one and the other Go programs and the JVM are conspiring with the OS to prevent that. Any program's critical path could be delayed, it just happens that we are talking GC today. To make things worse if one of the conspirators is doing a concurrent mark it will recruit up to GOMAXPROCS idle Ps, potentially using up all the HW threads the OS is managing. The OS simple notes the pressure on the HW thread resource and takes away a HW thread in the middle of the STW's critical path and doesn't give it back for a long time.
The GOMAXPROCS=2 hack simply limits the HW threads the OS gives to any single Go program. Instead of 1 Go program being able to eat up all 8 HW threads it now takes 4. The numbers you reported seems to help confirm this.
So the change to GOMAXPROCS makes the processes less noisy for the benefit of their neighbors, rather than make the processes individually better-able to run in a noisy environment.
It sounds then like there's nothing to be done for a while, at least for Go 1.8. Should this issue be closed, or postponed/rolled into future co-tenancy work?
Thanks for helping me understand this behavior.
The less noisy metaphor is a good one and it is what I believe is going on. Co-tenancy work is interesting stuff for the future but for now I think we should close this issue.
On Fri, Nov 25, 2016 at 1:52 PM, Rhys Hiltner notifications@github.com wrote:
So the change to GOMAXPROCS makes the processes less noisy for the benefit of their neighbors, rather than make the processes individually better-able to run in a noisy environment.
It sounds then like there's nothing to be done for a while, at least for Go 1.8. Should this issue be closed, or postponed/rolled into future co-tenancy work?
Thanks for helping me understand this behavior.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/17969#issuecomment-263010626, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn7L8K4WA8XQQEY_qVP5Ell6p7_eCks5rBy5RgaJpZM4K14BK .
On the other hand, the Go runtime is becoming a noisier neighbor: the GC is getting better and better at completing quickly. Commit 0bae74e8c9b5fab3baf61bde0169f4aa5e287bdc (for #14179) landed after I took this data, but looks like it will further those efforts. Since GOMAXPROCS defaults to NumCPU, even a mostly-idle daemon can create significant pressure on HW threads. This makes the host machine a noisier place for all programs.
Can Go be a less noisy neighbor by default? Is it worth addressing in Go 1.8 or 1.9?
To be specific: why is it beneficial to complete GC cycles quickly?
As the behavior and performance of the GC improves, what is the effect that the idle mark workers have on the performance of a single Go program? They seem to have a negative effect on neighboring programs, at least in aggregate.
/cc @RLH @aclements
The problem is that the OS scheduler has no way to discuss with the Go scheduler how best to use limited CPU resources. Both think they are omnipotent. Leaving CPU's idle on the off chance there are co-tenants does not seem like a good idea.
You have 40 Go programs running on 8 cores simultaneously. Why? Why not 5 with a GOMAXPROCS=1. Or 10 with GOMAXPROCS=2. Perhaps the runtime/scheduler issues that lead to 40 Go programs running simultaneously are easier problem to solve.
On Mon, Nov 28, 2016 at 6:34 PM, Rhys Hiltner notifications@github.com wrote:
On the other hand, the Go runtime is becoming a noisier neighbor: the GC is getting better and better at completing quickly. Commit 0bae74e https://github.com/golang/go/commit/0bae74e8c9b5fab3baf61bde0169f4aa5e287bdc (for #14179 https://github.com/golang/go/issues/14179) landed after I took this data, but looks like it will further those efforts. Since GOMAXPROCS defaults to NumCPU, even a mostly-idle daemon can create significant pressure on HW threads. This makes the host machine a noisier place for all programs.
Can Go be a less noisy neighbor by default? Is it worth addressing in Go 1.8 or 1.9?
To be specific: why is it beneficial to complete GC cycles quickly?
- Write barrier overhead is measurable but low, and my understanding of the plans for ROC is that the barrier would need to be enabled all of the time.
- The known latency bugs due to weird assist behavior are nearly solved (#14812 https://github.com/golang/go/issues/14812 remains outstanding), so there's less risk to having mutator assists enabled for a longer time.
- A faster GC cycle will result in less floating garbage, but the volume of new floating garbage is bounded by the mutator assists.
As the behavior and performance of the GC improves, what is the effect that the idle mark workers have on the performance of a single Go program? They seem to have a negative effect on neighboring programs, at least in aggregate.
/cc @RLH https://github.com/RLH @aclements https://github.com/aclements
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/17969#issuecomment-263429504, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn2XcvcDQUsMdD1-M7W6_Bsgg-O3rks5rC2UUgaJpZM4K14BK .
Would GC handling SIGXCPU help in this case?
The issue isn't that a Go process is exceeding some CPU limit and a SIGXCPU from the OS might help. The issue is that for a brief time 1 or 2 of the 40 Go processes become greedy and use all of the available CPU resources. If this happens at the wrong time it starves other Go processes that the greedy Go runtime/scheduler knows nothing about.
On Tue, Nov 29, 2016 at 11:40 AM, mrjrieke notifications@github.com wrote:
Would GC handling SIGXCPU help in this case?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/17969#issuecomment-263624329, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn2KxVfVJqoGpAVdCY1mBVyIpKuKgks5rDFWJgaJpZM4K14BK .
Leaving CPU's idle on the off chance there are co-tenants does not seem like a good idea.
I don't understand the goals of the GC idle mark workers. They lead to less wall clock time for the GC cycle, but why is that beneficial? The dedicated mark workers ensure that the cycle will complete in a timely manner, and the assist mechanism ensures that a fast-allocating mutator won't blow out the heap goal. The idle mark workers seem like they're to support Quality of Implementation goals, but I don't understand the full landscape of tradeoffs there.
In this application, the Ps were idle before the GC cycle began. Activating them on the off chance that the program is alone on the host also doesn't seem like a good idea.
Perhaps the runtime/scheduler issues that lead to 40 Go programs running simultaneously are easier problem to solve.
Are you asking if I'm running 40 Go programs to work around runtime or scheduler issues? I'm not—these programs are part of a live data processing pipeline, and each handles a single shard of data. The processes are limited by the rate that data arrives on their assigned shard rather than on the available CPU, so we can fit many of these processes (40 or more) on a single 8-core server before we're even close to running out of CPU (in the average case). I could change how the application is structured, but having an independent process for each shard on the server (~40) is the most convenient architecture.
These 40 Go processes are for my application, but they're not the only processes running on the host. The host also runs a few dozen daemons: cron, syslog, process supervisors, system monitors, etc. Some are written in C or Python, some are written in Go. When untuned, the Go daemons are the only ones that can so easily cause significant pressure on available HW threads. (The non-Go daemons may have other problems, but I'm less aware of them.)
This seems like a tradeoff against how quickly the GC cycle should complete. This is one cost of the GC idle mark workers, what is their benefit?
I think what @RLH suggested is that you limit GOMAXPROCS for some of the ~40 Go processes.
I limited GOMAXPROCS for all of the ~40 Go processes. My concern is that this implies that having untuned mostly-idle Go programs on a machine has a cost on tail latency of other programs that run on the machine: Go's default behavior is to be a noisy neighbor.
I ran a comparison against go1.7.4 and it doesn't look like Go tip has become any noisier since that release.
If we agree that this means Go defaults to being a noisy neighbor, I'd be interested to learn more about why Go chooses to be noisy. @RLH indicated that the noisiness comes from the GC putting every idle P to work to reduce the wall clock time of the GC cycle. My question on that is, "why is it beneficial to reduce the wall clock time of the GC cycle?".
Too late for 1.8, but I'll try to put together an experimental fix.
See also discussion of what appears to be a similar issue at #20141: When testing the test dir, all.bash spins up 5*NumCPU processes, each at full GOMAXPROCS, which makes my laptop sluggish and unresponsive.
@aclements what was the nature of your experimental fix? Just curious.
CL https://golang.org/cl/42431 mentions this issue.
@aclements what was the nature of your experimental fix? Just curious.
The plan, which I haven't implemented, was to lower the OS scheduling priority of the threads in idle GC workers (and raise it back up when leaving the idle worker). You need consensus between the workers to terminate a GC cycle, so lowering it all the way to idle priority would probably lead to nasty priority inversion issues, but it's probably safe to lower it somewhat.
I arrived here via #20457. Like @rhysh, I'm wondering if the benefit of idle GC workers is worth it. (The benefit being shorter GC cycles when a process is not using all of the CPU resources). It seems unintuitive for the GC to use more CPU when a process is relatively idle.
I don't understand the goals of the GC idle mark workers. They lead to less wall clock time for the GC cycle, but why is that beneficial?
Sorry I missed this question before.
You're right that the whole reason for idle mark workers is to shorten the wall-clock time of the GC cycle. The reason this is important is two-fold: 1) when GC is active, the write barrier is active, which slows down the mutator and 2) everything allocated during GC is retained by that cycle even if it's no longer reachable at the end, so the longer the cycle lasts, the more floating garbage is retained.
The plan, which I haven't implemented, was to lower the OS scheduling priority of the threads in idle GC workers (and raise it back up when leaving the idle worker).
I was thinking about this a bit more. A possibly more robust way to do this would be to use a separate canary thread that runs at idle priority (e.g., SCHED_IDLE
on Linux) and only serves the signal the runtime that there's CPU available. This signal would start up an idle worker (running at normal priority so it's more likely to make progress). The idle worker would yield its thread quickly (e.g., after a few milliseconds). If this freed up the CPU, the canary thread would get scheduled again, and we'd start the idle worker up again.
What version of Go are you using (
go version
)?go version
devel +1e3c57c Wed Nov 16 20:31:40 2016 +0000
with CL 33093 PS 4 cherry-picked on top, with GOEXPERIMENT=preemptibleloopsWhat operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
I have a program that reads ~1MB records from stdin, decodes the records, and sends UDP datagrams. The program runs with around 20 goroutines. Around 40 copies of the program run on an 8-core host.
What did you expect to see?
I expected the 95th percentile of mark termination and sweep termination pauses to be 100µs or less.
What did you see instead?
The program's 95th percentile sweep termination time is around 60ms, and 95th percentile mark termination time is around 30ms. Here's an example gctrace line:
The mark termination pauses are very easy to identify in the execution trace—if this bug needs to be about either sweep term pauses or mark term pauses, let's use it for mark term.
Sweep termination pause distribution:
Mark termination pause distribution:
And here's the execution trace of a slow mark termination pause, taking over 16ms. Proc 6 organizes the mark termination phase. Procs 1, 2, 3, 4, and 5 execute it quickly at around t=1007ms. Proc 0 does mark termination around t=1014ms, and Proc 7 delays until t=1024ms at which point the global pause concludes.