golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.2k stars 17.57k forks source link

runtime: SIGSEGV in preemptone (riscv64) #68862

Open gopherbot opened 1 month ago

gopherbot commented 1 month ago
#!watchflakes
default <- goarch == "riscv64" && builder == "linux-riscv64-mengzhuo" && `sigcode=1 addr=0xc0`  

Original flakes

#!watchflakes
default <- pkg == "golang.org/x/tools/internal/imports" && test == "TestModReplace2"

Issue created automatically to collect these failures.

Example (log):

=== RUN   TestModReplace2
SIGSEGV: segmentation violation
PC=0x54520 m=12 sigcode=1 addr=0xc0

goroutine 0 gp=0x3f504421c0 m=12 mp=0x3f50526708 [idle]:
runtime.preemptone(0x3f504421c0?)
    /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/proc.go:6297 +0x38 fp=0x3f5043ff28 sp=0x3f5043ff10 pc=0x54520
runtime.preemptall()
    /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/proc.go:6275 +0x60 fp=0x3f5043ff50 sp=0x3f5043ff28 pc=0x544c0
runtime.forEachPInternal(0x2fa878)
...
a3  0x223abf02  a4  0x3f98bb8000
a5  0x31a7d99   a6  0x29ab75fd
a7  0x1187f0    s2  0x3f5043fed0
s3  0x3f50526708    s4  0x3f50474000
s5  0x3f50241500    s6  0xffffffff
s7  0x4 s8  0x3f50038688
s9  0x3f5043fdc8    s10 0x2fa878
s11 0x3f504421c0    t3  0x2eb2a46908caf
t4  0xffffffffffffffff  t5  0x1913e15049b3
t6  0x3f50038408    pc  0x54520

watchflakes

gopherbot commented 1 month ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "golang.org/x/tools/internal/imports" && test == "TestModReplace2"
2024-08-13 17:00 x_tools-go1.23-linux-riscv64 tools@c1241b9c release-branch.go1.23@6885bad7 x/tools/internal/imports.TestModReplace2 [ABORT] (log) === RUN TestModReplace2 SIGSEGV: segmentation violation PC=0x54520 m=12 sigcode=1 addr=0xc0 goroutine 0 gp=0x3f504421c0 m=12 mp=0x3f50526708 [idle]: runtime.preemptone(0x3f504421c0?) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/proc.go:6297 +0x38 fp=0x3f5043ff28 sp=0x3f5043ff10 pc=0x54520 runtime.preemptall() /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/proc.go:6275 +0x60 fp=0x3f5043ff50 sp=0x3f5043ff28 pc=0x544c0 runtime.forEachPInternal(0x2fa878) ... a3 0x223abf02 a4 0x3f98bb8000 a5 0x31a7d99 a6 0x29ab75fd a7 0x1187f0 s2 0x3f5043fed0 s3 0x3f50526708 s4 0x3f50474000 s5 0x3f50241500 s6 0xffffffff s7 0x4 s8 0x3f50038688 s9 0x3f5043fdc8 s10 0x2fa878 s11 0x3f504421c0 t3 0x2eb2a46908caf t4 0xffffffffffffffff t5 0x1913e15049b3 t6 0x3f50038408 pc 0x54520

watchflakes

mknyszek commented 1 month ago

CC @golang/riscv64

mengzhuo commented 1 month ago

Updates: I've closed flake issues that with query "addr=0xc0 riscv64". All these failures related to same bad builder: linux-riscv64-mengzhuo--cm2

I found interesting logs in dmesg of this builder:

[12901.072321] INFO: task gc-stress:730941 blocked for more than 614 seconds.
[12901.079356]       Not tainted 6.6.36 #2.0~rc3.2+20240815152052
[12901.085355] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12901.100078] task:gc-stress       state:D stack:0     pid:730941 ppid:730855 flags:0x00000004
[12901.100120] Call Trace:
[12901.100126] [<ffffffff810c173a>] __schedule+0x28c/0x848
[12901.100153] [<ffffffff810c1d3e>] schedule+0x48/0xd2
[12901.100161] [<ffffffff810c20c0>] schedule_preempt_disabled+0x16/0x28
[12901.100170] [<ffffffff810c4f96>] rwsem_down_write_slowpath+0x220/0x58e
[12901.100183] [<ffffffff810c5374>] down_write+0x70/0x72
[12901.100191] [<ffffffff801a8ae8>] vma_expand+0x46/0x1ca
[12901.100202] [<ffffffff801ac086>] mmap_region+0x3c0/0x6b0
[12901.100212] [<ffffffff801ac596>] do_mmap+0x220/0x39e
[12901.100219] [<ffffffff8018649a>] vm_mmap_pgoff+0x8c/0x118
[12901.100232] [<ffffffff801a968a>] ksys_mmap_pgoff+0x3a/0x158
[12901.100240] [<ffffffff80005562>] __riscv_sys_mmap+0x2a/0x36
[12901.100250] [<ffffffff810bf224>] do_trap_ecall_u+0xba/0x12e
[12901.100259] [<ffffffff810c8722>] ret_from_exception+0x0/0x6e
[12901.100273] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

stack trace log below shows all these failure related to mmap call.

Bad builder is 4G RAM version of bananapi-f3, so I've update sysctl with

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Note: This builder still able to respond to prometheus-node-exporter, so I didn't get any warning :(

mengzhuo commented 1 month ago

Updates 28th, Aug: I've made contact with SpacemiT staff who confirmed that hardware litmus test works as expected after a two days run.

I've also upgraded two builders with a backported kernel patch related to hang mmap scheduler https://lore.kernel.org/all/20231213203001.179237-5-alexghiti@rivosinc.com/

Unfortunately, it doesn't work.

Now, I've suspend these two builders and wait for a kernel fix from SpacemiT.