golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.15k stars 17.46k forks source link

net: *TCPConn deadlines missed by more than 25% on NetBSD and OpenBSD #50189

Open bcmills opened 2 years ago

bcmills commented 2 years ago

In CL 366176 (for #36108), I increased the timeout slop in net.TestWriteTimeoutFluctuation and net.TestReadTimeoutFluctuation to 33% even for tests with very generous (multiple-second) timeouts. At that scale, that margin of slop should be trivial for even a heavily-loaded builder to hit.

Unfortunately, the NetBSD and OpenBSD builders still do not reliably hit it, even on an n1 instance that does not appear to be affected by #49209.

Given the other issues we've had with NetBSD and OpenBSD, I suspect a kernel bug. I plan to further raise the slop for those two platforms, narrow it everywhere else, and call it at that without investigating further. However, I suggest that folks who care about these kernels (@bsiegert, @coypoop, @4a6f656c?) may want to look into whether the underlying system calls may be adding unnecessary slop in their timeout handling.

greplogs --dashboard -md -l -e 'FAIL: TestWriteTimeoutFluctuation' --since=2021-11-23

2021-12-13T16:09:22-d198a36/openbsd-amd64-70-n2d 2021-11-29T19:45:58-f598e29/netbsd-amd64-9_0-n1 2021-11-29T16:08:23-a59ab29/openbsd-amd64-68 2021-11-25T00:07:28-f7e34e7/openbsd-amd64-70-n1 2021-11-25T00:07:11-c58243a/openbsd-amd64-70-n1

gopherbot commented 2 years ago

Change https://golang.org/cl/372215 mentions this issue: net: increase timing slop in TimeoutFluctuation tests on NetBSD and OpenBSD

bcmills commented 2 years ago

Expanding the regexp a bit to catch related tests:

greplogs --dashboard -md -l -e 'FAIL: Test(?:Read|ReadFrom|Write)TimeoutFluctuation' --since=2021-11-19

2021-12-13T16:09:22-d198a36/openbsd-amd64-70-n2d 2021-12-03T14:28:11-a174638/netbsd-386-9_0-n2d 2021-12-02T05:25:23-00dbcb3/netbsd-386-9_0 2021-12-01T22:20:39-c3a7fb2/openbsd-amd64-70 2021-11-30T20:04:58-7ccbcc9/openbsd-amd64-70-n1 2021-11-29T22:02:51-ebd0b77/openbsd-amd64-70 2021-11-29T22:02:45-1970e3e/openbsd-amd64-70 2021-11-29T19:45:58-f598e29/netbsd-amd64-9_0-n1 2021-11-29T16:08:23-a59ab29/openbsd-amd64-68 2021-11-27T23:27:52-9f2a075/openbsd-amd64-70-n1 2021-11-27T19:49:32-a142d65/openbsd-amd64-70-n1 2021-11-25T00:07:28-f7e34e7/openbsd-amd64-70-n1 2021-11-25T00:07:11-c58243a/openbsd-amd64-70-n1 2021-11-24T21:09:36-7e5331a/openbsd-amd64-70-n1 2021-11-24T20:53:54-a3b8f62/openbsd-amd64-70-n1 2021-11-22T18:55:55-2d7ae3f/openbsd-amd64-70 2021-11-22T18:50:19-9e94cc3/netbsd-amd64-9_0-n1 2021-11-22T18:50:19-9e94cc3/openbsd-amd64-70 2021-11-22T16:53:57-cd0bf38/netbsd-386-9_0-n1 2021-11-22T16:53:57-cd0bf38/netbsd-amd64-9_0 2021-11-22T16:15:36-6275b54/netbsd-386-9_0 2021-11-22T16:15:36-6275b54/netbsd-amd64-9_0-n1 2021-11-22T15:54:11-e73c6c8/netbsd-amd64-9_0 2021-11-22T12:30:26-b2aa138/netbsd-amd64-9_0-n1 2021-11-22T12:30:26-b2aa138/openbsd-amd64-70 2021-11-22T12:29:44-ffb6c79/netbsd-386-9_0 2021-11-22T12:29:44-ffb6c79/netbsd-amd64-9_0-n1 2021-11-22T12:29:44-ffb6c79/openbsd-amd64-70 2021-11-22T04:27:29-e30ebaa/netbsd-amd64-9_0 2021-11-22T03:24:07-a287c4a/openbsd-amd64-68 2021-11-22T03:24:01-0c3b4a3/netbsd-amd64-9_0 2021-11-22T03:24:01-0c3b4a3/openbsd-amd64-68 2021-11-22T03:24:01-0c3b4a3/openbsd-amd64-70 2021-11-20T08:47:36-91abe4b/netbsd-386-9_0-n1 2021-11-20T08:47:36-91abe4b/netbsd-amd64-9_0 2021-11-20T08:47:36-91abe4b/openbsd-amd64-70 2021-11-20T00:37:28-d2f4c93/netbsd-386-9_0 2021-11-20T00:32:49-57aba32/netbsd-amd64-9_0 2021-11-19T21:59:14-5e774b0/netbsd-386-9_0-n1 2021-11-19T21:59:14-5e774b0/openbsd-amd64-70

gopherbot commented 2 years ago

Change https://golang.org/cl/383175 mentions this issue: rate: allow for more timing slop in TestWaitSimple

bsiegert commented 2 years ago

A theory on what might be causing this extra "slop":

On NetBSD (not fully sure about OpenBSD), the granularity of sleeps and such is in terms of a kernel constant named HZ, whose default value is 100. Thus, jitter to the next 10-millisecond boundary is unfortunately expected.

bcmills commented 2 years ago

@bsiegert, empirically the amount of slop is not bounded to a constant — or, if it is, that constant is greater than a second. Many of those test failures ended in an attempt to sleep on the order of 3–5 seconds, and missed the target by well over a second.

(I agree that rounding up to a system-dependent subsecond granularity would be reasonable and expected, but that doesn't seem to be the case here.)

gopherbot commented 2 years ago

Change https://go.dev/cl/415234 mentions this issue: os: simplify deadline fluctuation tests