golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.92k stars 17.65k forks source link

runtime: frequent TestSegv failures since 2021-10-26 #49182

Open bcmills opened 3 years ago

bcmills commented 3 years ago

@prattmic, possibly related to CL 339989 / #47522?

greplogs --dashboard -md -l -e 'FAIL: TestSegv' --since=2021-10-01

2021-10-27T13:12:49-cfb5321/illumos-amd64 2021-10-27T08:50:27-bdefb77/illumos-amd64 2021-10-27T08:50:27-bdefb77/solaris-amd64-oraclerel 2021-10-27T05:33:58-ca5f65d/illumos-amd64 2021-10-26T23:12:17-13eccaa/illumos-amd64 2021-10-26T23:12:17-13eccaa/solaris-amd64-oraclerel 2021-10-26T23:06:10-e5c5125/illumos-amd64 2021-10-26T23:06:10-e5c5125/solaris-amd64-oraclerel 2021-10-26T22:05:53-80be4a4/illumos-amd64 2021-10-26T22:05:53-80be4a4/solaris-amd64-oraclerel 2021-10-26T21:32:57-86f6bf1/darwin-amd64-race 2021-10-26T21:32:57-86f6bf1/illumos-amd64

cuonglm commented 3 years ago

I think since https://go-review.googlesource.com/c/go/+/339990

CL https://go-review.googlesource.com/c/go/+/339989 causes windows-arm64-10 builder fails.

bcmills commented 3 years ago

Filed the windows-arm64-10 issue separately as #49188.

gopherbot commented 3 years ago

Change https://golang.org/cl/359254 mentions this issue: runtime: disable TestSegv on darwin, illumos, solaris

bcmills commented 3 years ago

Hmm, looks like relaxing the check didn't completely solve the problem:

greplogs --dashboard -md -l -e 'FAIL: TestSegv' --since=2021-10-28T16:54:00

2021-10-28T16:54:58-6bd0e7f/linux-arm-aws

bcmills commented 3 years ago

linux-arm-aws is the only one that appears to still be flaky:

greplogs --dashboard -md -l -e '(?ms)TestSegv.*unexpectedly saw "runtime: "'

2021-11-02T20:47:30-b246873/linux-arm-aws 2021-11-02T17:01:14-62b29b0/linux-arm-aws 2021-10-28T16:54:58-6bd0e7f/linux-arm-aws

aclements commented 3 years ago

This looks like the signal is landing while we're in the VDSO, presumably executing a kernel-provided atomic for casgstatus.

@prattmic, you looked at and fixed a few similar issues recently (e.g., 86f6bf18b013d570e89f57c2decaddca5ce2a847). Any insights on this one?

aclements commented 3 years ago

Yeah, the "unknown PC" is in the kernel-provided CAS implementation. The problem may just be that we're entirely missing the vdsoSP protection around the VDSO cas call (presumably the same thing applies to memory_barrier<>, too).

prattmic commented 3 years ago

Yes, I believe that is the case. I'll get those covered as well.

gopherbot commented 2 years ago

Change https://golang.org/cl/362796 mentions this issue: runtime/internal/atomic: treat ARM kernel helpers as VDSO

gopherbot commented 2 years ago

Change https://golang.org/cl/362795 mentions this issue: runtime: refactor ARM VDSO call setup to helper

gopherbot commented 2 years ago

Change https://golang.org/cl/362977 mentions this issue: runtime: start ARM atomic kernel helper traceback in caller

jeremyfaller commented 2 years ago

Mostly fixed. Not a Beta1 blocker as it's not a new breakage, just a stricter test.

bcmills commented 2 years ago

Looks like this is fixed on ARM and skipped on amd64, but still occasionally failing on some of the more exotic architectures. (It's not clear to me whether the remaining failures are arch-specific bugs.)

greplogs --dashboard -md -l -e 'FAIL: TestSegv ' --since=2021-11-13

2021-11-17T04:55:12-1d004fa/linux-mips64le-mengzhuo 2021-11-13T00:23:16-39bc666/linux-riscv64-unmatched

bcmills commented 2 years ago

Looks like both of those failures are in the SegvInCgo subtest.

bcmills commented 2 years ago

https://storage.googleapis.com/go-build-log/3c4f5e79/linux-riscv64-jsing_db2bc678.log (A linux-riscv64-jsing SlowBot), also failed in SegvInCgo.

bcmills commented 2 years ago

greplogs --dashboard -md -l -e 'FAIL: TestSegv ' --since=2021-11-18

2021-12-02T20:45:47-3c6295d/netbsd-amd64-9_0-n2d 2021-12-02T16:48:07-36be0be/netbsd-amd64-9_0-n2d 2021-11-24T21:11:45-b77f5f9/linux-mips64le-mengzhuo 2021-11-24T20:59:14-67dd9ee/linux-s390x-ibm 2021-11-24T20:53:29-003e7fa/linux-riscv64-jsing 2021-11-24T05:36:54-47db3bb/openbsd-amd64-70 2021-11-23T19:31:05-0f64c21/linux-riscv64-unmatched 2021-11-22T03:24:07-a287c4a/openbsd-amd64-70 2021-11-18T02:17:23-ce7e501/linux-arm64-packet

prattmic commented 2 years ago

The netbsd and openbsd failures there appear to be #49209.

mknyszek commented 2 years ago

Excluding the NetBSD and OpenBSD failures, all the ones that are left are in TestSegvInCgo like the two @bcmills posted about before. I don't think this is the same problem as the one @prattmic fixed earlier, because even an arm64 builder is failing there.

I'll poke at this one a bit.

mknyszek commented 2 years ago

In all these cases it looks like the signal is landing in some part of the cgocall path (not all that surprising), though I'm not sure I understand why gentraceback has issues producing a traceback in these cases.

mknyszek commented 2 years ago

oh oh, ok, actually for the linux/arm64 failure, the PC looks like the PC for the branch that @prattmic added in https://go.dev/cl/362977 (i.e. the top bits of the PC (in a 48-bit address space) are 0xffff). So there's some kind of other VDSO (?) call on at least this platform where there's more going on.

The other platforms' failures don't seem to look like this, however.

prattmic commented 2 years ago

The arm64 failure could occur even with 363459479014bde19e83d9fb6781310f63fb0b45 if one of the kernel helpers made a CALL (thus setting LR). I thought I verified that none of the helpers we used did this, but I guess I should go back and double-check on the specific kernel version on the builders.

Beyond this, it looks like we have failures on linux-mips64le, linux-riscv64, and linux-s390x.

mknyszek commented 2 years ago

I'll note that 3634594 only applies to 32-bit ARM, AFAICT, so it shouldn't be taking that path. It could still be that we have some VDSO call that makes frames.

The RISC-V failures are kind of interesting because there's no hexdump of the stack -- I'm not really sure why. Notably, on cgoSigtramp, the LR is 0 (cgoSigtramp is a thin wrapper around sigtramp that just jumps and links a zero LR.

prattmic commented 2 years ago

2021-12-08T23:00:01-f5ddd94/linux-386 2021-12-04T04:41:31-cd5f2cf/linux-mips64le-mengzhuo

prattmic commented 2 years ago

In the linux-386 failure, PC=0xf7f3db09 seems likely to be in the VDSO.

The VDSO load address is randomized, but this seems in approximately the correct range. Additionally, page offset 0xb09 is a valid instruction just after a syscall, which seems like a likely place to get a signal delivered.

   0xf7fd1b00 <__kernel_vsyscall>:      push   %ecx
   0xf7fd1b01 <__kernel_vsyscall+1>:    push   %edx
   0xf7fd1b02 <__kernel_vsyscall+2>:    push   %ebp
   0xf7fd1b03 <__kernel_vsyscall+3>:    mov    %esp,%ebp
   0xf7fd1b05 <__kernel_vsyscall+5>:    sysenter 
   0xf7fd1b07 <__kernel_vsyscall+7>:    int    $0x80
   0xf7fd1b09 <__kernel_vsyscall+9>:    pop    %ebp
   0xf7fd1b0a <__kernel_vsyscall+10>:   pop    %edx
   0xf7fd1b0b <__kernel_vsyscall+11>:   pop    %ecx
   0xf7fd1b0c <__kernel_vsyscall+12>:   ret  

Go doesn't use __kernel_vsyscall directly, but libc does. Maybe we get there somehow in a context where the signal handler doesn't realize this is cgo?

Edit: Note that assembly dump is from an unrelated runtime.test process that did not fail.

prattmic commented 2 years ago

e.g., __kernel_vsyscall is reachable from cgo_start_thread, which is called via asmcgocall.

It looks like the signal handler won't properly detect raw asmcgocall calls as cgo calls because sighandler checks g.m.incgo, which is only set in cgocall.

@cherrymui does this sound correct to you, or did I miss some detection mechanism?

cherrymui commented 2 years ago

@prattmic I think that is right. Direct asmcgocall is used in limited situations, but I guess it is still possible to receive an external signal.

VDSO calls made from Go would have m.vdsoPC and m.vdsoSP set, so we can tell. But if it is made from non-Go code then we can't tell.

cherrymui commented 2 years ago

I still think it is too strong to require that we can unwind the stack any time we receive an external signal. It think it is important to print SIGSEGV, but the unwinding could be best effort. Maybe we could try unwinding the curg stack if the signal is landed on g0 (even if incgo is not set) so it could be somewhat helpful.

prattmic commented 2 years ago

I still think it is too strong to require that we can unwind the stack any time we receive an external signal.

I tend to agree that as a general feature getting correct tracebacks when an external SIGSEGV lands in an esoteric location is not important to users. I've been pushing on these bugs primarily to support testing. To ensure that the common cases work and don't regress, I want to be able to write a straightforward test that tracebacks work. It is very difficult to write such a test if there are lots of edge cases that don't work properly.

For now, I'll skip this part of the test on 386, and I've filed #50504 as a low priority issue for this asmcgocall problem.

gopherbot commented 2 years ago

Change https://golang.org/cl/376656 mentions this issue: runtime: skip TestSegv traceback check on 386

bcmills commented 2 years ago

The linux-mips64le-mengzhuo failures are still occurring on the builder. It's not obvious to me whether that has the same root cause as the 386 failure mode, so I filed it separately as #50605.

gopherbot commented 2 years ago

Change https://go.dev/cl/431975 mentions this issue: runtime: enable TestSegv on darwin, illumos, solaris

bcmills commented 1 year ago

@golang/runtime, can this be re-triaged? The test still has a testenv.SkipFlaky call on darwin, illumos, and solaris referring to this issue — if there is some other tracking issue for the test (particularly given that darwin is a first-class platform), then the skip should be updated to refer to that issue.

gopherbot commented 1 year ago

Change https://go.dev/cl/491095 mentions this issue: runtime: add test skips for ios

mknyszek commented 1 year ago

Hello from triage. :) I think we didn't get to this because our incoming queue was quite busy. @prattmic will take a look at breaking it up. @cherrymui is working on a rewrite of the test that will split out the failures a bit better.

cherrymui commented 1 year ago

There are various TestSegv issues that watchflakes are tracking, e.g. #59443. I think we can close this as a dup. If this happens again watchflakes will post on or reopen an existing issue, or open a new one.

bcmills commented 1 year ago

@cherrymui, this issue is more about the followup work needed to diagnose the existing skips, particularly on darwin and on linux/386 (which are both first-class ports): https://cs.opensource.google/go/go/+/master:src/runtime/crash_cgo_test.go;l=667-675;drc=6dd3bfbed6f17e7789f092e96408c00c227a8b68

bcmills commented 1 year ago

That is: relying on watchflakes to report this issue is not appropriate because the relevant failures are already being skipped.

cherrymui commented 1 year ago

I think we know we are not always able to traceback from asynchronous interrupts in libc calls. I'm not sure if there is anything we can do. Or you mean we should push hard to make that work?

The linux/386 VDSO case may be fixable (or already fixed). #50504 should track that.

bcmills commented 1 year ago

I think we know we are not always able to traceback from asynchronous interrupts in libc calls. I'm not sure if there is anything we can do. Or you mean we should push hard to make that work?

I think that if we consider that to be normal operation, we should update the test to examine the output and confirm that it is consistent with what we would expect for that case (instead of calling testenv.SkipFlaky).

cherrymui commented 1 year ago

Sounds good. Thanks.