Closed gopherbot closed 2 months ago
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Change https://go.dev/cl/544375 mentions this issue: runtime: test for contention in both semaphore paths
Most of the failures are from before https://go.dev/cl/544195. Following that, I've seen four failures (via fetchlogs/greplogs).
https://build.golang.org/log/c733e91b4693774eb42501bbe93a8ee071ef312e addressed by https://go.dev/cl/544375 https://build.golang.org/log/79b92d5304bd79fbdfac90c953d53b172f2d8250 addressed by https://go.dev/cl/544375 https://build.golang.org/log/2e5e4f5629a179365a02d2bb9548ed5b94597c54 during a slow runtime test which I expect adds a lot of non-contention work, making it harder to encounter the desired contention ... logging to confirm in https://go.dev/cl/544375 https://build.golang.org/log/6ee5577c3e0a08531b8d7f009631323b4aab7c23 on linux-ppc64le, showing pprof with the amount of contention I expect but a metrics view with much more weight. One of the earlier linux-390x failures shows a similar behavior (though also failed due to the big-endian bug I fixed in CL 544195).
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
That's two more ppc64 failures from a clock mismatch (not sure what's up with ppc64's clocks), and one more of the type that https://go.dev/cl/544375 addresses.
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Note that there are some additional timeouts at https://github.com/golang/go/issues/55308#issuecomment-1819326331.
I'm not sure how to update the watchflakes rule to match these timeouts.
From that issue comment, 2023-11-19 22:05 darwin-amd64-longtest go@63828938 runtime
and 2023-11-20 02:02 darwin-amd64-longtest go@237715cf runtime
are timeouts during the maymorestack=mayMoreStackMove
re-run of the runtime tests. Those predate the roll-forward of https://go.dev/cl/544195 , which includes a related update to the semrelease test to address those exact failures. (I'd found them via fetchlogs
/greplogs
.)
Previously, the test involved three threads and required one of them to hold the semaphore's lock, one to contend for it, and the third to notice that contention and mark the test as complete. My interpretation of the mayMoreStackMove
failures is that they added a lot of non-contention work, which made it hard for the test's three threads to each be in those exact right spots at a single instant.
In https://go.dev/cl/544195 , I relaxed that requirement: now the test can finish early if one of the threads notices the other two have created the necessary contention, otherwise it'll give up after 10,000,000 iterations. (My darwin/arm64 system typically needed 500,000–4,000,000 iterations during the mayMoreStackPreempt
re-test.) In either case, the only verification it does of the result is that there's "more than zero" contention and that the call stack appears as expected.
I continue to struggle with the balance between wanting thorough tests for this part of the runtime (involving concurrency and clocks), and the need for 0% flakes. Maybe the performance dashboard sets an example there, including an opportunity for more output than ok runtime 690.399s
? I'd appreciate any advice you can share—thanks.
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
This failure appeared on a first-class port on https://go.dev/cl/546635. @bcmills was that CL based on a commit before https://go.dev/cl/544375 landed by any chance?
EDIT: Yes, it was. Here's the failure: https://ci.chromium.org/ui/p/golang/builders/try-workers/gotip-linux-arm64-test_only/b8762881407384401921/overview
From https://ci.chromium.org/ui/p/golang/builders/try-workers/gotip-linux-arm64-test_only/b8762881407384401921/overview, I get https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8762881407384401921/+/u/step/11/log/2, which shows two "metrics_test.go:1103: want stack" lines, indicating that it includes https://go.dev/cl/544375.
CL 546635 at PS 4 is https://go.googlesource.com/go/+/30e6fc629529abf8da4528f4fdbb5a78363624fb, with parent https://go.googlesource.com/go/+/2e6387cbec924dbd01007421d7442125037c66b2 .
I don't see a line in the failing build log matching https://go.googlesource.com/go/+/2e6387cbec924dbd01007421d7442125037c66b2/src/runtime/metrics_test.go#1271 , which means the test ran for 10,000,000 iterations without the test code noticing contention on the semaphore lock. The mutex profile result shows that the runtime itself didn't encounter contention either. (In #55160, the contention was present but the test had been unable to notice it and so would run forever.)
The "TestRuntimeLockMetricsAndProfile/runtime.lock" test verifies that runtime-internal lock contention is able to be reported with the correct count, magnitude, and call stack. The role of the "TestRuntimeLockMetricsAndProfile/runtime.semrelease" test is to check that the call stack ends at a particular depth, mostly so we can notice when that changes (so we can update the skip parameter, for example). It's proven tricker to test, since the lock itself isn't under the test's control. Do you have advice on how to test semacquire/semrelease, or is the best option to t.Skip
that for now?
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Updated the watchflakes
pattern to use post
instead of default
— there were a lot of these posted on #58901.
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
@rhysh, note that the failure in https://github.com/golang/go/issues/64253#issuecomment-1847818829 is on linux/386
, which is a first-class port.
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
https://ci.chromium.org/ui/p/golang/builders/try-workers/gotip-linux-arm64-test_only/b8761883502571661809/test-results?sortby=&groupby= on a linux/arm64
LUCI TryBot (also a first-class port)
Alrighty, I think we need to do something about this test.
@rhysh, I think from your last comment your conclusion is that this is a test-only issue, right? I think maybe for the moment let's skip the flakiest of the tests, which seems like the semrelease one.
Do you have advice on how to test semacquire/semrelease, or is the best option to t.Skip that for now?
Looking at the test, it seems like it just has 3 workers acquiring/releasing the same semaphore. Could you force them to block on the semaphore and operate in lockstep? Is the problem that it never takes the slow path, or that nothing ever blocks on the lock in the slow path? If it's the latter, then I guess operating in lockstep doesn't really help.
Would it be possible to manufacture something that also expects similar call stack pruning but is basically just a runtime.mutex
under the hood so we can be more certain about what's going to happen?
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
Hm, it seems like the runtime.lock
test is also failing a bunch. More often on the secondary port machines but still. Is that something of concern?
For semrelease
:
or that nothing ever blocks on the lock in the slow path?
Right, the critical section for manipulating semaphore sleeps is pretty small and not under the control of the (runtime-internal) user of the semaphore. There's also the problem that the runtime has 251 locks, with the hash of the semaphore address determining which to use. So observing contention on one of those shared locks (as the current test does) isn't proof that the test's own semaphore use is the one causing/experiencing the contention.
The failure in 2023-11-29T19:13:04-636c6e3/freebsd-arm64-dmgk saw contention (its output includes finished test early (9999534 tries remaining)
), but the only new contention is on runtime.unlock runtime.parkunlock_c runtime.park_m runtime.mcall
(benign, 12384 ns) and runtime.unlock runtime.semrelease1 sync.runtime_Semrelease sync.(*WaitGroup).Add sync.(*WaitGroup).Done runtime_test.TestRuntimeLockMetricsAndProfile.func4.1.1.1 runtime_test.(*contentionWorker).run
(semrelease but of a different semaphore, started.Done()
, 408 ns).
Maybe the semrelease test could look for a particular kind of failure rather than requiring success: If the profile includes the target function (runtime_test.TestRuntimeLockMetricsAndProfile.func6.1
), then make sure it has one of the two allowable call stacks. But don't require that the target function appears in the profile.
Or, outright skip the test for Go 1.22. It's unlikely we'll land a change that would accidentally break it further, and the level of detail this is trying to test isn't visible without an opt-in GODEBUG flag anyway.
In triage, we decided to just skip these tests for now. That'll let us put it to okay-after-rc1, but with a goal of further investigation.
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test == "TestRuntimeLockMetricsAndProfile"
More of an explanation: these tests are checking behavior that's behind a GODEBUG
anyway. It's fine if we just disable the tests, because even if something breaks between now and the next dev period (unlikely) it won't affect the vast majority of users.
We won't delete the tests because that functionality is going to be revisited for Go 1.23 anyway (by @rhysh I believe).
Change https://go.dev/cl/550775 mentions this issue: runtime: skip TestRuntimeLockMetricsAndProfile for flakiness
I think this can still be worked on in Go 1.22 as a test-only fix, but there's really no reason this has to remain a release blocker, even okay-after-rc1. Moving to Go 1.23.
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test ~ `TestRuntimeLockMetricsAndProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test ~ `TestRuntimeLockMetricsAndProfile`
Change https://go.dev/cl/581296 mentions this issue: runtime: test mutex contention stacks and counts
Change https://go.dev/cl/586237 mentions this issue: runtime: improve runtime-internal mutex profile tests
Change https://go.dev/cl/587515 mentions this issue: runtime: lower mutex contention test expectations
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test ~ `TestRuntimeLockMetricsAndProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime" && test ~ `TestRuntimeLockMetricsAndProfile`
Issue created automatically to collect these failures.
Example (log):
— watchflakes