Closed rhysh closed 2 years ago
This looks like a problem with linux/riscv64 signal handling, not specific to the cpu profiler: the tests for the os/signal package see similar crashes (though because of the test structure, it's in a subprocess rather than the test runner itself).
$ GOOS=linux GOARCH=riscv64 go test -c -o /tmp/signal.go1.17.3.test os/signal
$ go version /tmp/signal.go1.17.3.test
/tmp/signal.go1.17.3.test: go1.17.3
$ # scp and ssh to the linux/riscv64 host
sipeed@sipeed:~$ nohup ./stress ./signal.go1.17.3.test -test.short >> ./nohup.log &
sipeed@sipeed:~$ tail -n 1 ./nohup.log
3h18m50s: 2809 runs so far, 3 failures (0.11%)
One of the failures is fatal error: unexpected signal during runtime execution
. Two of them are fatal error: runtime: split stack overflow
.
kindly cc @cherrymui @4a6f656c @prattmic
Yeah, it does look like signal related. Is it reproducible at tip? Thanks.
Yes, the runtime/pprof crashes reproduce at tip. It's very fast, I got 10 failures in under 4 minutes. The crashes in os/signal are slower, less than one per hour on go1.17.3. I'll leave it running.
$ ~/gotip/bin/go version
go version devel go1.18-cd0bf38966 Mon Nov 22 16:53:57 2021 +0000 darwin/amd64
$ GOOS=linux GOARCH=riscv64 ~/gotip/bin/go test -c -o /tmp/signal.go1.18-devel-cd0bf38966.test os/signal
$ GOOS=linux GOARCH=riscv64 ~/gotip/bin/go test -c -o /tmp/pprof.go1.18-devel-cd0bf38966.test runtime/pprof
$ go version /tmp/*devel*.test
/tmp/pprof.go1.18-devel-cd0bf38966.test: devel go1.18-cd0bf38966 Mon Nov 22 16:53:57 2021 +0000
/tmp/signal.go1.18-devel-cd0bf38966.test: devel go1.18-cd0bf38966 Mon Nov 22 16:53:57 2021 +0000
# scp, ssh
sipeed@sipeed:~$ nohup ./stress ./pprof.go1.18-devel-cd0bf38966.test -test.run=TestCPUProfileMultithreaded -test.short > ./nohup.pprof.cd0bf38966.log &
sipeed@sipeed:~$ tac ./nohup.pprof.cd0bf38966.log | grep failures | head -n 1
3m15s: 192 runs so far, 10 failures (5.21%)
sipeed@sipeed:~$ (for f in /tmp/go-stress-20211122T181250-* ; do grep fatal $f | head -n 1 ; done) | sort | uniq -c | head
8 fatal error: unexpected signal during runtime execution
2 fatal error: unknown caller pc
sipeed@sipeed:~$ nohup ./stress ./signal.go1.18-devel-cd0bf38966.test -test.short > ./nohup.signal.cd0bf38966.log &
sipeed@sipeed:~$ tac ./nohup.signal.cd0bf38966.log | grep failures | head -n 1
9m35s: 136 runs so far, 0 failures
I haven't seen any os/signal failures yet at tip, and with the go1.17.3 failure rate I'd expect "about 6" by now:
sipeed@sipeed:~$ tac ./nohup.signal.cd0bf38966.log | grep failures | head -n 1
6h39m45s: 5692 runs so far, 0 failures
The os/signal failures reproduce at tip, but very slowly (10 hours):
sipeed@sipeed:~$ tac ./nohup.signal.cd0bf38966.log | grep failures | head -n 1
21h21m25s: 18251 runs so far, 2 failures (0.01%)
sipeed@sipeed:~$ (for f in /tmp/go-stress-20211122T181657-* ; do grep fatal $f | head -n 1 ; done) | sort | uniq -c | head
1 fatal error: runtime: split stack overflow
1 fatal error: unexpected signal during runtime execution
Yes, the runtime/pprof crashes reproduce at tip. It's very fast, I got 10 failures in under 4 minutes. The crashes in os/signal are slower, less than one per hour on go1.17.3. I'll leave it running.
I'm not able to reproduce this here on a SiFive HiFive Unleashed:
$ ../bin/go version
go version devel go1.18-00045b76e5 Tue Nov 23 17:30:32 2021 +0000 linux/riscv64
$ ../bin/go test -c -o /tmp/pprof.test runtime/pprof
$ while /tmp/pprof.test -test.run=TestCPUProfileMultithreaded -test.short ; do : ; done
That ran for several hours without failing - @mengzhuo can you test on your SiFive HiFive Unmatched?
@4a6f656c Joel, I ran this test for about 1 hour and no failure occurred. I notice that the CPU of Nezha is only one core. Should we disable the multi-thread tests for single core CPU?
some updates, It runs 18 hours and no failures.
Yes, the runtime/pprof crashes reproduce at tip. It's very fast, I got 10 failures in under 4 minutes. The crashes in os/signal are slower, less than one per hour on go1.17.3. I'll leave it running.
I'm not able to reproduce this here on a SiFive HiFive Unleashed:
$ ../bin/go version go version devel go1.18-00045b76e5 Tue Nov 23 17:30:32 2021 +0000 linux/riscv64 $ ../bin/go test -c -o /tmp/pprof.test runtime/pprof $ while /tmp/pprof.test -test.run=TestCPUProfileMultithreaded -test.short ; do : ; done
That ran for several hours without failing - @mengzhuo can you test on your SiFive HiFive Unmatched?
Should we disable the multi-thread tests for single core CPU?
It's a crash, not a test failure, so I don't think that disabling the tests is the right thing to do.
The TestCPUProfileMultithreaded
case sets GOMAXPROCS to 2. On multi-core machines, does the crash reproduce when taskset
or similar restricts all threads to run on a single core?
I ran the stress test for several hours with async preemption disabled and saw zero failures:
sipeed@sipeed:~$ nohup env GODEBUG=asyncpreemptoff=1 ./stress ./pprof.go1.18-devel-cd0bf38966.test -test.run='^TestCPUProfileMultithreaded$' -test.short > ./nohup.pprof.cd0bf38966.asyncpreemptoff.log &
# wait a while
sipeed@sipeed:~$ tac ./nohup.pprof.cd0bf38966.asyncpreemptoff.log | grep failures | head -n 1
5h23m0s: 15074 runs so far, 0 failures
Some of the failures I see, which complain of fatal error: unexpected signal during runtime execution
, happen soon after a SIGPROF signal arrives on a thread that is already running on the gsignal stack to handle SIGURG, meaning that inside the sighandler function, gp == _g_.m.gsignal
.
The other two failure modes, fatal error: runtime: split stack overflow
and fatal error: unknown caller pc
, seem like they could also be caused by stack-sharing: one execution using the stack from where the other left off leading to larger-than-expected stack use, or one clobbering the return values that the other had saved.
What, if anything, would make it safe for the signal handlers to be re-entrant, and/or to share the M's gsignal stack? I haven't seen these failures on linux/amd64 (though I don't typically use single-core machines).
@rhysh thanks for the logs! If the signal handler is reentrant, or the signal stack is somehow shared, I think it is very wrong. Could you print the stack pointer at entry of signal handler to see if it is running on the same stack?
gp == _g_.m.gsignal
I think that condition should never be true. This could either due to that we're getting nested signals or the G is somehow wrong. As we got signal PC in runtime.pcdatavalue2
, which is only called from the signal handler, it indicates that we're actually getting nested signals, which should not happen as we block signals when running the signal handler.
What kernel version are you running on? @4a6f656c and @mengzhuo , what kernel version does the builder running?
Yes, the variation in the result of getcallersp
makes it looks like nested signals. I'm running the OS installation that came with the machine, kernel 5.4.61
.
sipeed@sipeed:~$ uname -a
Linux sipeed 5.4.61 #15 PREEMPT Thu Jun 10 06:50:03 UTC 2021 riscv64 GNU/Linux
In a crash from "fatal error: unexpected signal during runtime execution", it looks like M0 processes SIGURG and SIGPROF always with SP 0x3fa80098d8. On M3, it usually processes those with SP 0x3fa805b8d8. But then it gets a SIGPROF/27 on PC 0x16c34 (runtime.(*debugLogWriter).varint
, hmm) and the SP for that call to runtime.sighandler is 0x3fa805b0c8, a 0x810 or 2064-byte delta from the usual place.
We don't see a log line from a competing SIGURG/23 delivery, but we do see that the SIGPROF/27 signal came when the PC was inside a function that only the signal handler would call. And we see a split in the debuglog output (">> begin log 0 <<"), which seems to mean that something else (the signal handler) was using the debuglog that's usually in the global pool (between calls to dlog
and end
) which meant the SIGPROF/27 handler needed to allocate another buffer.
sipeed@sipeed:~$ cat /tmp/go-stress-20211129T195813-1450030122 | grep 'sp 0x' | cut -d' ' -f 5-10 | sort | uniq -c
9 m 0 sp 0x3fa80098d8 sig 23
7 m 0 sp 0x3fa80098d8 sig 27
1 m 3 sp 0x3fa805a378 sig 27
1 m 3 sp 0x3fa805aa48 sig 11
1 m 3 sp 0x3fa805b0c8 sig 11
1 m 3 sp 0x3fa805b0c8 sig 27
7 m 3 sp 0x3fa805b8d8 sig 23
7 m 3 sp 0x3fa805b8d8 sig 27
On linux/amd64, adding strace -f -ttt -e rt_sigaction
shows lines that set sa_mask=~[]
, which I take to mean "all bits set".
On linux/riscv64, the strace output shows sa_mask=[]
, which I take to mean "all bits cleared". When I modify runtime.setsig to set the sa_restorer
field, I see it reflected in strace's view of the sa_mask
field.
It looks from this like the definition of runtime.sigactiont
is not correct on linux/riscv64, resulting in no signals at all being blocked when handling a signal.
Good finding. From the kernel C header
struct sigaction {
__sighandler_t sa_handler;
unsigned long sa_flags;
#ifdef SA_RESTORER
__sigrestore_t sa_restorer;
#endif
sigset_t sa_mask; /* mask last for extensibility */
};
So it does look wrong if SA_RESTORER is not defined. And it does look like not defined on riscv64.
Change https://golang.org/cl/367635 mentions this issue: runtime: fix riscv64 sigaction mask field offset
What version of Go are you using (
go version
)?(Cross-compiling to linux/riscv64)
Does this issue reproduce with the latest release?
Yes, this problem appears with go1.17.3, the latest stable release.
What operating system and processor architecture are you using (
go env
)?I only have a working Go installation on my darwin/amd64 machine (below). I'm cross-compiling from there to linux/riscv64, where I see the problem.
go env
OutputWhat did you do?
I've got a small RISCV64 computer: Nezha with the Allwinner D1 SOC. I'm running the runtime/pprof tests at version go1.17.3 and seeing several types of "fatal error" failures.
What did you expect to see?
Near-zero failure rate of runtime/pprof's TestCPUProfileMultithreaded test in short mode when using go1.17.3, and zero crashes.
What did you see instead?
Currently about 1.5% failure rate, all of which are crashes ("fatal error"). (On the plus side, I haven't seen any test failures for that test.)
In the last 42 minutes, the stress run has collected:
Here's an example of each:
"fatal error: unexpected signal during runtime execution" / "[signal SIGSEGV: segmentation violation code=0x1 addr=0x2 pc=0x2]"
``` sipeed@sipeed:~$ cat /tmp/go-stress-20211121T161735-4177513427 fatal error: unexpected signal during runtime execution [signal SIGSEGV: segmentation violation code=0x1 addr=0x2 pc=0x2] goroutine 0 [idle]: runtime: unexpected return pc for runtime.sigpanic called from 0x2 stack: frame={sp:0x3fb005b730, fp:0x3fb005b760} stack=[0x3fb0054000,0x3fb005c000) 0x0000003fb005b630: 0x0000000000000000 0x000000000004cb6c"runtime: unexpected return pc for runtime.sigpanic called from 0x0" / "fatal error: unknown caller pc"
``` sipeed@sipeed:~$ cat /tmp/go-stress-20211121T161735-4026159646 runtime: unexpected return pc for runtime.sigpanic called from 0x0 stack: frame={sp:0x3fb405b910, fp:0x3fb405b940} stack=[0x3fb402e800,0x3fb402f000) fatal error: unknown caller pc runtime stack: runtime.throw({0x1aa247, 0x11}) /usr/local/go/src/runtime/panic.go:1198 +0x60 runtime.gentraceback(0x49820, 0x3fb405b8f8, 0x0, 0x3fb40ba820, 0x0, 0x0, 0x7fffffff, 0x3fb403bfc0, 0x0, 0x0) /usr/local/go/src/runtime/traceback.go:274 +0x17c0 runtime.addOneOpenDeferFrame.func1() /usr/local/go/src/runtime/panic.go:751 +0x88 runtime.systemstack() /usr/local/go/src/runtime/asm_riscv64.s:131 +0x58 goroutine 8 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_riscv64.s:94 +0x8 fp=0x3fb405b7f0 sp=0x3fb405b7e8 pc=0x80768 runtime.addOneOpenDeferFrame(0x3fb40ba820, 0x49820, 0x3fb405b8f8) /usr/local/go/src/runtime/panic.go:750 +0x80 fp=0x3fb405b828 sp=0x3fb405b7f0 pc=0x4b160 panic({0x1871a0, 0x2c5ab0}) /usr/local/go/src/runtime/panic.go:998 +0xcc fp=0x3fb405b8f8 sp=0x3fb405b828 pc=0x4c09c runtime.panicmem() /usr/local/go/src/runtime/panic.go:221 +0x60 fp=0x3fb405b910 sp=0x3fb405b8f8 pc=0x49820 runtime: unexpected return pc for runtime.sigpanic called from 0x0 stack: frame={sp:0x3fb405b910, fp:0x3fb405b940} stack=[0x3fb402e800,0x3fb402f000) runtime.sigpanic() /usr/local/go/src/runtime/signal_unix.go:735 +0x264 fp=0x3fb405b940 sp=0x3fb405b910 pc=0x66a04 created by runtime/pprof.TestCPUProfileMultithreaded.func1 /usr/local/go/src/runtime/pprof/pprof_test.go:100 +0x64 goroutine 1 [chan receive]: testing.(*T).Run(0x3fb40ba1a0, {0x1acff7, 0x1b}, 0x1b7750) /usr/local/go/src/testing/testing.go:1307 +0x3a8 testing.runTests.func1(0x3fb40ba1a0) /usr/local/go/src/testing/testing.go:1598 +0x84 testing.tRunner(0x3fb40ba1a0, 0x3fb409dd50) /usr/local/go/src/testing/testing.go:1259 +0xf8 testing.runTests(0x3fb400a030, {0x2cb500, 0x1e, 0x1e}, {0x0, 0x0, 0x0}) /usr/local/go/src/testing/testing.go:1596 +0x434 testing.(*M).Run(0x3fb40a4680) /usr/local/go/src/testing/testing.go:1504 +0x574 main.main() _testmain.go:101 +0x174 goroutine 6 [runnable]: runtime/pprof.cpuHog2(0x0) /usr/local/go/src/runtime/pprof/pprof_test.go:71 +0x1c runtime/pprof.cpuHogger(0x1b7940, 0x2fc268, 0x5f5e100) /usr/local/go/src/runtime/pprof/pprof_test.go:40 +0x5c runtime/pprof.TestCPUProfileMultithreaded.func1(0x5f5e100) /usr/local/go/src/runtime/pprof/pprof_test.go:104 +0x8c runtime/pprof.testCPUProfile(0x3fb40ba340, 0x1b79b8, {0x3fb4062f58, 0x2, 0x2}, {0x0, 0x0, 0x0}, 0x1b7748) /usr/local/go/src/runtime/pprof/pprof_test.go:324 +0x108 runtime/pprof.TestCPUProfileMultithreaded(0x3fb40ba340) /usr/local/go/src/runtime/pprof/pprof_test.go:98 +0xd0 testing.tRunner(0x3fb40ba340, 0x1b7750) /usr/local/go/src/testing/testing.go:1259 +0xf8 created by testing.(*T).Run /usr/local/go/src/testing/testing.go:1306 +0x388 goroutine 7 [sleep]: time.Sleep(0x5f5e100) /usr/local/go/src/runtime/time.go:193 +0x16c runtime/pprof.profileWriter({0x1e0400, 0x3fb400e8a0}) /usr/local/go/src/runtime/pprof/pprof.go:799 +0x68 created by runtime/pprof.StartCPUProfile /usr/local/go/src/runtime/pprof/pprof.go:784 +0x190 ERROR: exit status 2 ```"fatal error: runtime: split stack overflow"
``` sipeed@sipeed:~$ cat /tmp/go-stress-20211121T161735-2006060682 runtime: newstack sp=0x3fa80098d0 stack=[0x3fa8062000, 0x3fa8063000] morebuf={pc:0x0 sp:0x3fa80098d0 lr:0x0} sched={pc:0x667b0 sp:0x3fa80098d0 lr:0x0 ctxt:0x0} runtime: gp=0x3fa80bc340, goid=6, gp->status=0x2 runtime: split stack overflow: 0x3fa80098d0 < 0x3fa8062000 fatal error: runtime: split stack overflow runtime stack: runtime.throw({0x1add89, 0x1d}) /usr/local/go/src/runtime/panic.go:1198 +0x60 runtime.newstack() /usr/local/go/src/runtime/stack.go:1032 +0x850 runtime.morestack() /usr/local/go/src/runtime/asm_riscv64.s:202 +0x70 goroutine 6 [running]: runtime: unexpected return pc for runtime.sigpanic called from 0x0 stack: frame={sp:0x3fa80098d0, fp:0x3fa80098d0} stack=[0x3fa8062000,0x3fa8063000) runtime.sigpanic() /usr/local/go/src/runtime/signal_unix.go:716 +0x10 fp=0x3fa80098d0 sp=0x3fa80098d0 pc=0x667b0 created by testing.(*T).Run /usr/local/go/src/testing/testing.go:1306 +0x388 goroutine 1 [chan receive]: testing.(*T).Run(0x3fa80bc000, {0x1acff7, 0x1b}, 0x1b7750) /usr/local/go/src/testing/testing.go:1307 +0x3a8 testing.runTests.func1(0x3fa80bc000) /usr/local/go/src/testing/testing.go:1598 +0x84 testing.tRunner(0x3fa80bc000, 0x3fa809dd50) /usr/local/go/src/testing/testing.go:1259 +0xf8 testing.runTests(0x3fa800a030, {0x2cb500, 0x1e, 0x1e}, {0x0, 0x0, 0x0}) /usr/local/go/src/testing/testing.go:1596 +0x434 testing.(*M).Run(0x3fa80a4680) /usr/local/go/src/testing/testing.go:1504 +0x574 main.main() _testmain.go:101 +0x174 goroutine 7 [sleep]: time.Sleep(0x5f5e100) /usr/local/go/src/runtime/time.go:193 +0x16c runtime/pprof.profileWriter({0x1e0400, 0x3fa800e8a0}) /usr/local/go/src/runtime/pprof/pprof.go:799 +0x68 created by runtime/pprof.StartCPUProfile /usr/local/go/src/runtime/pprof/pprof.go:784 +0x190 goroutine 8 [runnable]: runtime/pprof.cpuHog0(...) /usr/local/go/src/runtime/pprof/pprof_test.go:59 runtime/pprof.cpuHog1(0x0) /usr/local/go/src/runtime/pprof/pprof_test.go:54 +0xc runtime/pprof.cpuHogger(0x1b7938, 0x2fc260, 0x5f5e100) /usr/local/go/src/runtime/pprof/pprof_test.go:40 +0x5c runtime/pprof.TestCPUProfileMultithreaded.func1.1(0x5f5e100, 0x3fa805c3c0) /usr/local/go/src/runtime/pprof/pprof_test.go:101 +0x44 created by runtime/pprof.TestCPUProfileMultithreaded.func1 /usr/local/go/src/runtime/pprof/pprof_test.go:100 +0x64 ERROR: exit status 2 ```