Open gopherbot opened 2 years ago
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime/trace" && test == "TestTraceCPUProfile"
As a first pass:
The failure on linux-arm-aws
is a timeout, and looks like a conflict between these two goroutines:
goroutine 32 [waiting]:
runtime.systemstack_switch()
/workdir/go/src/runtime/asm_arm.s:274 +0x4 fp=0x183acdc sp=0x183acd8 pc=0x84088
runtime.stopTheWorld({0x1c6ecd, 0xc})
/workdir/go/src/runtime/proc.go:1090 +0x98 fp=0x183acfc sp=0x183acdc pc=0x53e88
runtime.stopTheWorldGC({0x1c6ecd, 0xc})
/workdir/go/src/runtime/proc.go:1137 +0x48 fp=0x183ad10 sp=0x183acfc pc=0x54054
runtime.StopTrace()
/workdir/go/src/runtime/trace.go:320 +0x24 fp=0x183ad28 sp=0x183ad10 pc=0x74ab8
runtime/trace.Stop()
/workdir/go/src/runtime/trace/trace.go:148 +0x78 fp=0x183ad44 sp=0x183ad28 pc=0xe697c
runtime/trace_test.TestTraceCPUProfile(0x1c04e10)
/workdir/go/src/runtime/trace/trace_test.go:621 +0x1a4 fp=0x183af98 sp=0x183ad44 pc=0x1885ec
testing.tRunner(0x1c04e10, 0x1d3994)
/workdir/go/src/testing/testing.go:1446 +0x118 fp=0x183afe0 sp=0x183af98 pc=0xf3654
testing.(*T).Run.func1()
/workdir/go/src/testing/testing.go:1493 +0x30 fp=0x183afec sp=0x183afe0 pc=0xf44ac
runtime.goexit()
/workdir/go/src/runtime/asm_arm.s:824 +0x4 fp=0x183afec sp=0x183afec pc=0x85d14
created by testing.(*T).Run
/workdir/go/src/testing/testing.go:1493 +0x3a0
goroutine 82 [running]:
goroutine running on other thread; stack unavailable
created by runtime/trace.Start
/workdir/go/src/runtime/trace/trace.go:128 +0xd4
The goroutine created on src/runtime/trace/trace.go:128 is the one that calls runtime.ReadTrace
: https://github.com/golang/go/blob/38edd9bd8da9d7fc7beeba5fd4fd9d605457b04e/src/runtime/trace/trace.go#L128-L136
The failure on windows-arm64-10
is a crash, fatal error: unexpected signal during runtime execution
/ [signal 0xc0000005 code=0x0 addr=0x0 pc=0x7ff6f49eb8e4]
, coming from runtime.gentraceback
as called by runtime.sigprof
: https://github.com/golang/go/blob/55ca6a20afb718972f0625ce971081f78efe10bf/src/runtime/traceback.go#L162
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime/trace" && test == "TestTraceCPUProfile"
Found new dashboard test flakes for:
#!watchflakes
post <- pkg == "runtime/trace" && test == "TestTraceCPUProfile"
Sorry, but there were parse errors in the watch flakes script. The script I found was:
#!watchflakes
post <- pkg == ("runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`)
And the problems were:
script:2.16: == requires quoted string
See https://go.dev/wiki/Watchflakes for details.
Sorry, but there were parse errors in the watch flakes script. The script I found was:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`)
And the problems were:
script:2.107: unexpected )
See https://go.dev/wiki/Watchflakes for details.
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
This latest set of failures seem related to the v2 tracer; those on arm64 are in the internal/trace/v2
package (right after the ok internal/trace
line).
The linux-386-longtest
failure looks like a logical race in the v2 tracer's wakeableSleep
mechanism, near https://github.com/golang/go/blob/e14b96cb511ab3437dedf2be2fa49d4516189bbc/src/runtime/trace2.go#L908
CC @mknyszek for those.
The openbsd-386-72
failure is for the v1 runtime/trace.TestTraceCPUProfile
test, running out of memory while parsing the trace. https://github.com/golang/go/blob/d573a8a8e86157d1d9ea43e6bcb03ce5b9baa9fb/src/internal/trace/parser.go#L269 is events = append(events, ev)
, where ev
is a type rawEvent struct {int; byte; []uint64; []string}
, which on GOARCH=386 would be 8 words of 4 bytes each, 32 bytes in total. The failed allocation is runtime.growslice(0xac500000, 0x2c301, 0x2c300, 0x1, 0x81e4340)
, growing that slice beyond 180992 entries. That's 5.8 MB (growing to 0x375 pages of 8 kiB each, 7.3 MB) which doesn't seem large by itself, but maybe 180k+ entries in the test trace is unusually large? I don't have a baseline for that. There are nice log lines in mheap.grow
when the call to mheap.sysAlloc
fails, but the crash here is in sysMap
(which throws directly in sysMapOS
) so the log doesn't include context on the total heap size.
the crash here is in
sysMap
(which throws directly insysMapOS
) so the log doesn't include context on the total heap size.
Compare:
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Sorry for not replying sooner. The linux-386-longtest failure was fixed by https://go.dev/cl/542335.
The new failure is fun -- that may be an actual failure at runtime from the new tracer. I'll try to reproduce.
As a side-note, the runtime/trace tests barely use the new tracer at all (TestTraceCPUProfile
doesn't). If there's a failure there, it's not related to that specifically.
Oh! No, that new failure is not real actually. I put a low-quality optimistic "deadlock detector" but it turns out there was no deadlock in this case at all. The test just didn't expect the additional output.
I'll put the "deadlock detector" behind a constant. It's for debugging anyway.
Change https://go.dev/cl/544235 mentions this issue: runtime: disable trace v2 optimistic deadlock debugger
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Note that https://github.com/golang/go/issues/55317#issuecomment-1847818692 seems to have dropped CPU samples.
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
@prattmic, @mknyszek: this still has a fairly high failure rate. Is it still being investigated? (Would it make sense to add a testenv.SkipFlaky
for this failure mode for the 1.22 release?)
Apologies, I got deeply side-tracked.
I think SkipFlaky
on some platforms is correct here. It occurs to me that this test is inherently flaky. There's nothing guaranteeing that the goroutine that transfers CPU samples from the profile buffer to the trace will run before tracing ends. It can definitely happen that the goroutine just gets delayed for some reason and samples are dropped. More specifically, the goroutine that does the transfer awakens for the first time to see tracing is off, so it ignores the rest of the profile buffer's contents and drops them on the floor. It can't really do anything else because tracing is off; there's nowhere to put the information! This is fine -- samples are always allowed to drop.
I'll think about how to make this test less flaky, but given that it's not failing with any regularly on first-class platforms, I think doing platform-specific skips is fine. There's nothing really platform-specific about this code.
If there is nothing platform-specific about the code, why would a platform-specific skip be even remotely appropriate‽
One of the interesting properties of the reverse builders for secondary ports is that they run on more diverse hardware — with more diverse scheduling — than our GCE-based builders. It seems gravely mistaken to assume that external users on first-class ports are using hardware and kernels that match our GCE builders, so why should we assume that they will not also run into this flakiness when running go test all
or similar on their own CI systems?
Yeah, you're right. I frequently forget that other users might run go test all
; I was mainly just thinking about test coverage loss on our side.
I think I can just make the test more resilient to lost samples. It seems like in all these failures at least some of the samples show up in the trace. I figure we can just make the test pass if at least one of these samples shows up in the trace.
This only verifies that functionality works at a baseline, not how well it works, but given that the functionality is best-effort anyway, I think that's about as good as we can make it for now.
Taking a step back, I think this actually represents a regression in behavior vs. the old tracer, but it was really hard to tell because of how best-effort this functionality is.
Specifically, the old tracer was subtly ensuring that there was a CPU profile buffer flush sometime close to when a trace would end, because the trace reader would get woken up a whole bunch. In the new tracer the CPU profile buffer is translated into the execution trace via a separate goroutine. This has in general been positive for understanding the best-effort-ness of this behavior, but what we lost was this property that a flush was likely to happen close in time to when the trace ended.
My proposal to restore this behavior is just to wake the trace CPU profile reader goroutine before we're about to change generations. It does not ensure that a flush happens, but makes it far more likely. It's still best-effort, but a bit more predictable and will provide more understandable results.
Overall, I'm not really that happy with the fact that this is so best-effort still. Early on in the trace design I tried to write this trace CPU profile reader goroutine to actually block on the CPU profile buffer getting filled, making it much more precise. But because of the problems on Darwin with CPU profiles we encountered earlier this year, I had to work around that.
In the future it would be good to consider blocking in traceAdvance
until we can ensure a recent flush of the CPU profile buffer happened. I'm reluctant to introduce that now just because that's how you introduce deadlocks in a new release, but I'll file an issue to track that.
I'll send a CL with the better-effort wakeup which should hopefully resolve most, if not all of the failures seen in this issue.
Change https://go.dev/cl/555495 mentions this issue: runtime: make a much better effort to emit CPU profile in a trace
@prattmic convinced me to try to do the right thing here, and eliminate all parts of the tracer making this CPU profile -> trace process best-effort. https://go.dev/cl/555495 implements this. I was originally skeptical given that we're pretty late in the release cycle, but I figure at the very least we can have this on hand and decide whether it's worth submitting now. If it's not, I can land it for next cycle and I'll send the wakeup patch (which is like, 5 lines of code) which will certainly not make the situation any worse.
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Change https://go.dev/cl/557838 mentions this issue: [release-branch.go1.22] runtime: make a much better effort to emit CPU profile in a trace
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Found new dashboard test flakes for:
#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
Issue created automatically to collect these failures.
Example (log):
— watchflakes