Closed bcmills closed 1 year ago
This happened in a TryBot in https://storage.googleapis.com/go-build-log/f1e11825/android-amd64-emu_262486a5.log:
##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
fatal: systemstack called from unexpected goroutineTrap
exitcode=133
FAIL runtime 16.177s
FAIL
2022/05/03 14:18:09 Failed: exit status 1
go tool dist: FAILED
Marking as release-blocker because this affects TryBot runs. Since android/amd64
is not a first-class port, either the underlying bug can be diagnosed and fixed, or the builder can be removed from the default TryBot set. (I'll leave that choice up to @golang/runtime to decide and implement.)
This may or may not be OS-specific. There is another failure in the builder logs since February, but on plan9
rather than android
; it isn't obvious to me whether that is an independent bug.
greplogs -l -e 'fatal: systemstack called from unexpected goroutine' --since=2022-02-03
2022-03-05T21:20:16-e155b03-45f4544/plan9-amd64-0intro
greplogs -l -e 'fatal: systemstack called from unexpected goroutine' --since=2022-03-06
2022-05-03T19:48:07-bccce90/android-arm64-corellium
@golang/runtime This is a second class port, but because it's a trybot, this is a release blocker. Should we consider removing this as a trybot? Is that bringing us enough value?
Change https://go.dev/cl/407615 mentions this issue: dashboard: remove android-amd64-emu from main go repo's TryBot set
I've mailed CL 407615 that makes android-amd64-emu a post-submit builder only (in the main repo) while investigation of this issue is underway. If submitted, this issue can be unmarked as a release-blocker for Go 1.19.
Curiously, this does not appear to be arch-specific: we've seen these failures on both amd64
and arm64
.
greplogs -l -e 'fatal: systemstack called from unexpected goroutine' --since=2022-05-04
2022-05-20T22:30:37-2b0e457/android-arm64-corellium
The first failure shows exitcode=133
. This is likely bash parlance for exiting with signal 5 (SIGTRAP). From man bash
: The return value of a simple command is its exit status, or 128+n if the command is terminated by signal n.
If I recall correctly, Android applies a seccomp syscall filter to (all?) processes. I wonder if we are violating this filter on the throw path, resulting in truncation of the stack trace. seccomp with mode SECCOMP_RET_TRAP sends a SIGTRAP on violation.
@golang/android do you know if the Android seccomp filters apply to processes on our builders, and if so which one?
No repros of this on 25 gomotes all weekend. I did find #53250, plus several no context SIGSEGVs in the runtime test, like:
##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
Segmentation fault
exitcode=139
FAIL»...runtime»19.914s
FAIL
2022/06/05 22:34:10 Failed: exit status 1
(Some where in the standard runtime test rather the -cpu variant)
This isn't a first-class port, so dropping release-blocker.
This isn't a first-class port, so dropping release-blocker.
This port is still run as a default TryBot until/unless CL 407615 is merged. IMO known failures on TryBots should still block releases, since they still add testing noise for anyone who uses TryBots on a pending change.
In the interest of decoupling this issue from the Android TryBots in general, I've filed #53377 (as a release-blocker) to decide whether to remove the TryBots or fix their known failure modes.
Summarizing the known failures with this pattern on Android:
greplogs -l -e '(?ms)\Aandroid-.*^fatal: systemstack called from unexpected goroutine'
2022-05-20T22:30:37-2b0e457/android-arm64-corellium
2022-05-03T19:48:07-bccce90/android-arm64-corellium
2022-02-02T21:12:39-53d6a72/android-amd64-emu
2021-10-08T16:26:20-59d4e92-99c1b24/android-amd64-emu
So it looks like this bug was probably introduced sometime in 2021..? (Or else, maybe the check itself was introduced then? 😅)
Change https://go.dev/cl/412174 mentions this issue: dashboard: add known issues for android-*-emu
Rolling forward to 1.20.
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `android` && `systemstack called from unexpected goroutine`
— watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `android` && `systemstack called from unexpected goroutine`
— watchflakes
Seems no new failure for some time.
Note that the rate of testing is much lower now because of the freeze. (6 months is a good window size for checking failure rates.)
Still none after the tree reopened. Maybe fixed?
Change https://go.dev/cl/465156 mentions this issue: dashboard: unmark known-issues with low failure rates
Timed out in state WaitingForInfo. Closing.
(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)
greplogs --dashboard -md -l -e '^fatal: systemstack called from unexpected goroutine' --since=2021-01-01
2022-02-02T21:12:39-53d6a72/android-amd64-emu
2021-10-08T16:26:20-59d4e92-99c1b24/android-amd64-emu
I'll also note that
badsystemstackMsg
seems to be missing a final newline as of CL 93659 (CC @aclements @randall77). 😅