Open bcmills opened 3 years ago
I think since https://go-review.googlesource.com/c/go/+/339990
CL https://go-review.googlesource.com/c/go/+/339989 causes windows-arm64-10 builder fails.
Filed the windows-arm64-10
issue separately as #49188.
Change https://golang.org/cl/359254 mentions this issue: runtime: disable TestSegv on darwin, illumos, solaris
Hmm, looks like relaxing the check didn't completely solve the problem:
greplogs --dashboard -md -l -e 'FAIL: TestSegv' --since=2021-10-28T16:54:00
linux-arm-aws
is the only one that appears to still be flaky:
greplogs --dashboard -md -l -e '(?ms)TestSegv.*unexpectedly saw "runtime: "'
2021-11-02T20:47:30-b246873/linux-arm-aws 2021-11-02T17:01:14-62b29b0/linux-arm-aws 2021-10-28T16:54:58-6bd0e7f/linux-arm-aws
This looks like the signal is landing while we're in the VDSO, presumably executing a kernel-provided atomic for casgstatus.
@prattmic, you looked at and fixed a few similar issues recently (e.g., 86f6bf18b013d570e89f57c2decaddca5ce2a847). Any insights on this one?
Yeah, the "unknown PC" is in the kernel-provided CAS implementation. The problem may just be that we're entirely missing the vdsoSP protection around the VDSO cas call (presumably the same thing applies to memory_barrier<>
, too).
Yes, I believe that is the case. I'll get those covered as well.
Change https://golang.org/cl/362796 mentions this issue: runtime/internal/atomic: treat ARM kernel helpers as VDSO
Change https://golang.org/cl/362795 mentions this issue: runtime: refactor ARM VDSO call setup to helper
Change https://golang.org/cl/362977 mentions this issue: runtime: start ARM atomic kernel helper traceback in caller
Mostly fixed. Not a Beta1 blocker as it's not a new breakage, just a stricter test.
Looks like this is fixed on ARM and skipped on amd64, but still occasionally failing on some of the more exotic architectures. (It's not clear to me whether the remaining failures are arch-specific bugs.)
greplogs --dashboard -md -l -e 'FAIL: TestSegv ' --since=2021-11-13
2021-11-17T04:55:12-1d004fa/linux-mips64le-mengzhuo 2021-11-13T00:23:16-39bc666/linux-riscv64-unmatched
Looks like both of those failures are in the SegvInCgo
subtest.
https://storage.googleapis.com/go-build-log/3c4f5e79/linux-riscv64-jsing_db2bc678.log (A linux-riscv64-jsing
SlowBot), also failed in SegvInCgo
.
greplogs --dashboard -md -l -e 'FAIL: TestSegv ' --since=2021-11-18
2021-12-02T20:45:47-3c6295d/netbsd-amd64-9_0-n2d 2021-12-02T16:48:07-36be0be/netbsd-amd64-9_0-n2d 2021-11-24T21:11:45-b77f5f9/linux-mips64le-mengzhuo 2021-11-24T20:59:14-67dd9ee/linux-s390x-ibm 2021-11-24T20:53:29-003e7fa/linux-riscv64-jsing 2021-11-24T05:36:54-47db3bb/openbsd-amd64-70 2021-11-23T19:31:05-0f64c21/linux-riscv64-unmatched 2021-11-22T03:24:07-a287c4a/openbsd-amd64-70 2021-11-18T02:17:23-ce7e501/linux-arm64-packet
The netbsd and openbsd failures there appear to be #49209.
Excluding the NetBSD and OpenBSD failures, all the ones that are left are in TestSegvInCgo
like the two @bcmills posted about before. I don't think this is the same problem as the one @prattmic fixed earlier, because even an arm64 builder is failing there.
I'll poke at this one a bit.
In all these cases it looks like the signal is landing in some part of the cgocall path (not all that surprising), though I'm not sure I understand why gentraceback has issues producing a traceback in these cases.
oh oh, ok, actually for the linux/arm64 failure, the PC looks like the PC for the branch that @prattmic added in https://go.dev/cl/362977 (i.e. the top bits of the PC (in a 48-bit address space) are 0xffff
). So there's some kind of other VDSO (?) call on at least this platform where there's more going on.
The other platforms' failures don't seem to look like this, however.
The arm64 failure could occur even with 363459479014bde19e83d9fb6781310f63fb0b45 if one of the kernel helpers made a CALL (thus setting LR). I thought I verified that none of the helpers we used did this, but I guess I should go back and double-check on the specific kernel version on the builders.
Beyond this, it looks like we have failures on linux-mips64le, linux-riscv64, and linux-s390x.
I'll note that 3634594 only applies to 32-bit ARM, AFAICT, so it shouldn't be taking that path. It could still be that we have some VDSO call that makes frames.
The RISC-V failures are kind of interesting because there's no hexdump of the stack -- I'm not really sure why. Notably, on cgoSigtramp
, the LR is 0 (cgoSigtramp
is a thin wrapper around sigtramp
that just jumps and links a zero LR.
In the linux-386 failure, PC=0xf7f3db09
seems likely to be in the VDSO.
The VDSO load address is randomized, but this seems in approximately the correct range. Additionally, page offset 0xb09
is a valid instruction just after a syscall, which seems like a likely place to get a signal delivered.
0xf7fd1b00 <__kernel_vsyscall>: push %ecx
0xf7fd1b01 <__kernel_vsyscall+1>: push %edx
0xf7fd1b02 <__kernel_vsyscall+2>: push %ebp
0xf7fd1b03 <__kernel_vsyscall+3>: mov %esp,%ebp
0xf7fd1b05 <__kernel_vsyscall+5>: sysenter
0xf7fd1b07 <__kernel_vsyscall+7>: int $0x80
0xf7fd1b09 <__kernel_vsyscall+9>: pop %ebp
0xf7fd1b0a <__kernel_vsyscall+10>: pop %edx
0xf7fd1b0b <__kernel_vsyscall+11>: pop %ecx
0xf7fd1b0c <__kernel_vsyscall+12>: ret
Go doesn't use __kernel_vsyscall
directly, but libc does. Maybe we get there somehow in a context where the signal handler doesn't realize this is cgo?
Edit: Note that assembly dump is from an unrelated runtime.test process that did not fail.
e.g., __kernel_vsyscall
is reachable from cgo_start_thread
, which is called via asmcgocall
.
It looks like the signal handler won't properly detect raw asmcgocall
calls as cgo calls because sighandler
checks g.m.incgo
, which is only set in cgocall
.
@cherrymui does this sound correct to you, or did I miss some detection mechanism?
@prattmic I think that is right. Direct asmcgocall
is used in limited situations, but I guess it is still possible to receive an external signal.
VDSO calls made from Go would have m.vdsoPC and m.vdsoSP set, so we can tell. But if it is made from non-Go code then we can't tell.
I still think it is too strong to require that we can unwind the stack any time we receive an external signal. It think it is important to print SIGSEGV, but the unwinding could be best effort. Maybe we could try unwinding the curg stack if the signal is landed on g0 (even if incgo is not set) so it could be somewhat helpful.
I still think it is too strong to require that we can unwind the stack any time we receive an external signal.
I tend to agree that as a general feature getting correct tracebacks when an external SIGSEGV lands in an esoteric location is not important to users. I've been pushing on these bugs primarily to support testing. To ensure that the common cases work and don't regress, I want to be able to write a straightforward test that tracebacks work. It is very difficult to write such a test if there are lots of edge cases that don't work properly.
For now, I'll skip this part of the test on 386, and I've filed #50504 as a low priority issue for this asmcgocall
problem.
Change https://golang.org/cl/376656 mentions this issue: runtime: skip TestSegv traceback check on 386
The linux-mips64le-mengzhuo
failures are still occurring on the builder. It's not obvious to me whether that has the same root cause as the 386
failure mode, so I filed it separately as #50605.
Change https://go.dev/cl/431975 mentions this issue: runtime: enable TestSegv on darwin, illumos, solaris
@golang/runtime, can this be re-triaged? The test still has a testenv.SkipFlaky
call on darwin
, illumos
, and solaris
referring to this issue — if there is some other tracking issue for the test (particularly given that darwin
is a first-class platform), then the skip should be updated to refer to that issue.
Change https://go.dev/cl/491095 mentions this issue: runtime: add test skips for ios
Hello from triage. :) I think we didn't get to this because our incoming queue was quite busy. @prattmic will take a look at breaking it up. @cherrymui is working on a rewrite of the test that will split out the failures a bit better.
There are various TestSegv issues that watchflakes are tracking, e.g. #59443. I think we can close this as a dup. If this happens again watchflakes will post on or reopen an existing issue, or open a new one.
@cherrymui, this issue is more about the followup work needed to diagnose the existing skips, particularly on darwin
and on linux/386
(which are both first-class ports):
https://cs.opensource.google/go/go/+/master:src/runtime/crash_cgo_test.go;l=667-675;drc=6dd3bfbed6f17e7789f092e96408c00c227a8b68
That is: relying on watchflakes
to report this issue is not appropriate because the relevant failures are already being skipped.
I think we know we are not always able to traceback from asynchronous interrupts in libc calls. I'm not sure if there is anything we can do. Or you mean we should push hard to make that work?
The linux/386 VDSO case may be fixable (or already fixed). #50504 should track that.
I think we know we are not always able to traceback from asynchronous interrupts in libc calls. I'm not sure if there is anything we can do. Or you mean we should push hard to make that work?
I think that if we consider that to be normal operation, we should update the test to examine the output and confirm that it is consistent with what we would expect for that case (instead of calling testenv.SkipFlaky
).
Sounds good. Thanks.
@prattmic, possibly related to CL 339989 / #47522?
greplogs --dashboard -md -l -e 'FAIL: TestSegv' --since=2021-10-01
2021-10-27T13:12:49-cfb5321/illumos-amd64 2021-10-27T08:50:27-bdefb77/illumos-amd64 2021-10-27T08:50:27-bdefb77/solaris-amd64-oraclerel 2021-10-27T05:33:58-ca5f65d/illumos-amd64 2021-10-26T23:12:17-13eccaa/illumos-amd64 2021-10-26T23:12:17-13eccaa/solaris-amd64-oraclerel 2021-10-26T23:06:10-e5c5125/illumos-amd64 2021-10-26T23:06:10-e5c5125/solaris-amd64-oraclerel 2021-10-26T22:05:53-80be4a4/illumos-amd64 2021-10-26T22:05:53-80be4a4/solaris-amd64-oraclerel 2021-10-26T21:32:57-86f6bf1/darwin-amd64-race 2021-10-26T21:32:57-86f6bf1/illumos-amd64