Open bcmills opened 11 months ago
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Since this seems to be very specific to the builder, in triage we're wondering if anyone from @golang/s390x could start taking a look. It seems to be failing often enough that bisection should identify the culprit, and then that might just reveal the issue.
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
@mknyszek @bcmills thank's for bringing this issue to our attention. I am looking into the issue.
@mknyszek @bcmills We had a look at this issue and we used git bisect to figure out the culprit commit.
Bad Commit Id: de5b418b which introducing failure of TestCrashDumpsAllThreads() We had a look at the file changes corresponding to the bad commit, we see that there are some changes to signal_unix.go file and we suspect that issue lies with the changes made to this file. We are further having look at the changes and we will update when we find more about the root cause of the failure.
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
cc @zzkcode (https://github.com/golang/go/issues/64650#issuecomment-1857902016)
@srinivas-pokala Note that https://go.dev/cl/554615 is adjusting this behavior slightly, though I don't know if it will affect these failures.
The failures do look concerning. e.g., in https://build.golang.org/log/526c7281403cbfa93b3aa016ec3281c2e487d7ec m=5 is running a goroutine, but it doesn't get a SIGQUIT dump.
If this is easily reproducible, something to try is to see whether either the crash thread or non-crash threads are hitting the timeouts defined in https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=777-785;drc=1ae729e6d34040a84da8ef2fc0b9781efe9b0d95 (i.e., sleeping for 5s total). We don't want that in the normal case. The normal case should be crashing.Loading()
incrementing enough to exit the loop.
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
If this is easily reproducible, something to try is to see whether either the crash thread or non-crash threads are hitting the timeouts defined in https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=777-785;drc=1ae729e6d34040a84da8ef2fc0b9781efe9b0d95 (i.e., sleeping for 5s total). We don't want that in the normal case. The normal case should be
crashing.Loading()
incrementing enough to exit the loop.
@prattmic Thanks. Just take a look on several test failures here, seems none of them exceed 5s(if test time is working correctly). Before this change, usually the last thread which received SIGQUIT will attempt to crash the process, and the others will sleep 5s. After that, the first crash thread will periodically check and crash the process if crashing.Load() > mcount()-int32(extraMLength.Load())
. My guess is that maybe it needs more time to receive and handle the SIGQUIT on slow machine? In the former case, if the last one is ready to crash, the others may had already in sleep state. In the latter case, however, they may just increase the crashing
and even have no time to print out the SIGQUIT and PC... while the first thread is started to crash?
--- FAIL: TestCrashDumpsAllThreads (0.52s)
--- FAIL: TestCrashDumpsAllThreads (3.04s)
--- FAIL: TestCrashDumpsAllThreads (0.11s)
--- FAIL: TestCrashDumpsAllThreads (0.36s)
--- FAIL: TestCrashDumpsAllThreads (0.48s)
--- FAIL: TestCrashDumpsAllThreads (1.24s)
My guess is that maybe it needs more time to receive and handle the SIGQUIT on slow machine? In the former case, if the last one is ready to crash, the others may had already in sleep state. In the latter case, however, they may just increase the
crashing
and even have no time to print out the SIGQUIT and PC... while the first thread is started to crash?
sighandler
should not be re-raising SIGQUIT
until after it has already dumped the backtrace for the current thread.
The pattern in the recent watchflakes reports for this issue is that one thread is reported in crashDumpsAllThreadsLoop
, and three others (presumably the other crashDumpsAllThreadsLoop
threads) are reported as goroutine running on other thread; stack unavailable
.
That is, sighandler
successfully received the signal and ran, but the crash thread's call to tracebackothers
failed to print stacks for those other goroutines. That happens here:
https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=1255-1258;drc=a9c9cc07ac0d3dc73865a57e6ce45c22ada3b5c9
@prattmic, that suggests an answer to your question:
If this is easily reproducible, something to try is to see whether either the crash thread or non-crash threads are hitting the timeouts
The answer, I think, is that it doesn't matter: the crash thread doesn't start propagating the SIGQUIT
signal to the other threads until after it has already called tracebackothers
and dumped the redacted stacks.
@bcmills Am I understanding correctly that you are saying that ultimately the test is failing because traceback reports goroutine running on other thread; stack unavailable
for the running threads? If so, I think that is a misunderstanding. That report is expected and appears even when the test passes (attached below). Part of the purpose of sending SIGQUIT to every thread so we can report the stacks even of threads that are spinning in user code; this test tries to ensure that we reach every thread.
Passing test output:
Hmm. The goroutine running on other thread; stack unavailable
message was printed for three goroutines, but that would imply four SIGQUIT
traces in the output (one for sysmon
and one for each thread running such a goroutine).
However, in https://build.golang.org/log/5e29be06bbb7a8a4923d83a58972aa2a10677a94 there are only three thread tracebacks from SIGQUIT
signals:
sysmon
thread, which certainly could not have been running crashDumpsAllThreadsLoop
.crashDumpsAllThreadsLoop
(as expected).findRunnable
.Even if we assume that the findRunnable
thread had previously been running one of the crashDumpsAllThreadsLoop
goroutines, that still leaves at least one thread unaccounted-for.
I thought maybe the SIGQUIT
signals aren't delivered to the running threads at all because of #18717. (This is a -race
builder, after all.)
But that doesn't quite fit either: if that were the case, I would not expect any crashDumpsAllThreadsLoop
threads with SIGQUIT
dumps at all, but there is one present. 🤔
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
For the record, I hit this on go1.22rc2 on a Fedora build.
https://kojipkgs.fedoraproject.org//work/tasks/3287/112403287/build.log
The hardware can be seen here
It built after another run, in the same machine.
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Found new dashboard test flakes for:
#!watchflakes
post <- goarch == "s390x" && pkg == "runtime" && test == "TestCrashDumpsAllThreads"
Go version
ca804118b1ff4ea6cd0b83808e140671d7e10f6f
What operating system and processor architecture are you using (
go env
)?What did you do?
https://build.golang.org/log/5e29be06bbb7a8a4923d83a58972aa2a10677a94
What did you expect to see?
All tests passing.
What did you see instead?
(attn @golang/s390x)