Closed rsc closed 8 years ago
@rsc - it seems you're still using my cluster; do you still need it? I would also like to use. also, do you have any new findings? Thanks!
Hi @wojtek-t, I do not still need it. Feel free to take down the jobs I have left (should be just a stop-kubemark.sh away from being clean, and I will not start anything new). My quota came through a few days ago.
I have a lot of new data from the runs in your cluster that I haven't had time to go through properly. My son got some nasty virus over the weekend and I lost a lot of this week to staying home to take care of him. But I haven't forgotten!
I sent pull request kubernetes/kubernetes#23210 to clean up one use of time.After that occurs on a particularly highly executed path during the e2e test. See the commit description for details.
Times for the kubemark e2e test using Go 1.4.2 and Go 1.6.0, both before and after the timer fix:
p50: 148±7 p90: 328±19 p99: 513±29 n: 10 Go 1.4.2
p50: 141±9 p90: 383±32 p99: 604±44 n: 11 Go 1.6.0
p50: 151±8 p90: 339±19 p99: 479±20 n: 9 Go 1.4.2 after timer fix
p50: 140±14 p90: 360±31 p99: 483±39 n: 10 Go 1.6.0 after timer fix
I believe that with the timer fix applied, Go 1.6.0 is safe for Kubernetes to merge. As mentioned in the PR commit message, I also intend to make Go 1.7 not require that kind of fix, so that all the other uses of time.After will become asymptotically more efficient automatically. This is #8898. Once #8898 is submitted, then I am OK with recognizing case <-time.After(d)
in the compiler, #8895, which will make things a constant factor more efficient.
@dvyukov has also been investigating some other possible causes of high 99th percentile latency as part of investigating #14790. He has a number of CLs pending as I write this, and they're linked from that issue.
One final note, for @wojtek-t and @gmarek, is that I think Kubernetes may be significantly mismeasuring its server latency in this benchmark, at least with Go 1.4. A 'LIST nodes' RPC comes in about every 5 seconds, and a GC in Go 1.4 takes about 100ms. That means there's about a 1 in 50 chance of the RPC arriving during the GC. If that happens, the RPC sits in a kernel receive buffer while the GC finishes; only once the GC is done does the apiserver read the request and start its timer. The reported latency therefore ignores the time spent waiting for the GC that was going on. Still using very rough numbers, this will affect about 2% of requests, and it will add half a GC pause to each on average. Without knowing the precise shape of the underlying distribution it's hard to estimate the effect on 99th percentile latency, but certainly there is an effect. The reported 99th percentile latency will be lower, perhaps significantly lower, than the actual 99th percentile latency.
Go 1.5 introduced a concurrent garbage collector, so that the actual stop-the-world pauses, during which the apiserver ignores incoming network traffic, are much smaller. The apiserver should be able to read the request and start the timer earlier in the cases just described, although since a GC is still in progress taking up some of the CPU, the overall time spent on the request will be longer. So part of the reported latency increase from Go 1.4 to Go 1.5 may well be real problems introduced by Go 1.5, but part of the reported latency increase is from reporting something closer to the actual latency.
I think it's encouraging that, after the timer fix, moving from Go 1.4 to Go 1.6 reports basically no latency increase. Since you'd expect the reported latencies to have gone up due to the more accurate reporting but the reported latencies are level, that suggests that actual latencies went down from Go 1.4 to Go 1.6, a win for users if not for the SLO/SLA measurement.
I'm hopeful that the combination of Dmitry's scheduler fixes, the timer channel fixes, and @RLH and @aclements's overall GC throughput fixes will make the kubemark latency even lower in Go 1.7. Now that we can run the benchmark, I intend to make sure we test that theory.
Once these fixes are all in and I've tested with current Go 1.7 master (on my own cluster, not woktek's), I will close this issue.
P.S. The ± above are standard deviation, and all the times are 'LIST nodes' apiserver-reported latency in ms.
If we are talking about hundreds of milliseconds, I don't think my pending changes will provide significant improvement. But who knows.
As mentioned in the PR commit message, I also intend to make Go 1.7 not require that kind of fix, so that all the other uses of time.After will become asymptotically more efficient automatically.
Finally!
@rsc - thanks for awesome work! If you have any suggestions on how to make time measurement better, we're happy to hear them. IIUC this problem impacts pretty much every
start := time.Now()
...
latency := time.Now().Subtract(start)
pattern measuring wall time. It is a common pattern in Kubernetes (and I guess in quite a few other projects)
On the topic of accurately timing things, there is also the fact that time.Now()
doesn't use a monotonic clock source (#12914). So because time.Now()
is based on [__vdso_]clock_gettime(CLOCK_REALTIME, ...)
[on Linux at least], when the time changes (either due to administratively changing the system's time or due to NTP making a big adjustment or due to leap seconds) the kinds of subtractions shown above will get affected in funny ways (including potentially returning negative values), and when NTP disciplines the clock, time as measured by time.Now()
doesn't actually pass at a constant rate.
The problem is not the use of time.Now and time.Since, nor the particular kind of clock. (Btw @gmarek, time.Since reads better than time.Now().Subtract.)
The problem is that the timer is only started once the server has read the request off the network. Any delay before the request is read (for example, if the request arrives during a GC pause) is therefore not accounted for. The GC pauses that prevent the program from reading from the network and are therefore unaccounted for are larger in Go 1.4 than in Go 1.5. This means that Go 1.5+ is probably giving a more accurate accounting of RPC latency than Go 1.4 did. At least part of the observed latency increase is due to this better accounting.
I fully understand what you're saying @rsc, just pointing out another issue to keep in mind when one wants to accurately measure things.
For measuring the amount of time spent in kernel buffers, this was actually an issue in Google's internal RPC stack that was fixed back around 2009 or so, I forget if it was by using SIOCGSTAMP
or SO_TIMESTAMP
(although ISTR that by default only UDP was supported and we had to change the kernel to also support this on TCP streams).
CL https://golang.org/cl/21503 mentions this issue.
Kubernetes is on Go 1.6 as of April 5 (see kubernetes/kubernetes#20656).
Yeah - with 1.6 we didn't see any issues. Thanks a lot for help!
The Kubernetes team found that updating from Go 1.4.2 to Go 1.4.3 caused a performance regression for them, and also that Go 1.5.3 was bad as well.
The 1.4.3 bug is kubernetes/kubernetes#17524, "fixed" by reverting to Go 1.4.2. The error report at the top of the bug is difficult for me to interpret: to me it says mainly that 1 != 0 and 3 != 0. A comment on the next bug says “... but we can't do what InfluxDB did, because we have some pretty massive issues on 1.4.3. 1.4.2 was the last even vaguely happy point.” Right now, I don't have any way to quantify the 1.4.2 to 1.4.3 problem other than that remark. (The InfluxDB note is referring to #14189, but we don't know that they are necessarily related.)
The 1.5.3 bug is kubernetes/kubernetes#20400, "fixed" by reverting to Go 1.4.2 again. That bug has clearer data: on a particular load test, tail latencies are significantly higher in 1.5.3.
The Kubernetes team created kubernetes/kubernetes#20656 to track further investigation of these regressions. This is the corresponding issue on the Go side.