milkv-pioneer / issues

5 stars 0 forks source link

显卡在插槽 1 时读写 SSD 导致系统崩溃问题分析 #27

Open u0076 opened 1 year ago

u0076 commented 1 year ago

问题描述

使用 Fedora 38,rootfs 装在 SSD 上,打开 supertuxkart 时,系统崩溃。串口报告 log 如下:

fedora-riscv login: milkv
Password:
[   66.903727] systemd-journald[861]: File /var/log/journal/ff9bc945b5324838ad7107720fd1ffb6/user-1000.journal corrupted or uncleanly shut
 down, renaming and replacing.
Last login: Thu Jun 15 21:59:04 from 192.168.2.230
[milkv@fedora-riscv ~]$
[milkv@fedora-riscv ~]$
[milkv@fedora-riscv ~]$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
mmcblk1     179:0    0 117.8G  0 disk
├─mmcblk1p1 179:1    0   122M  0 part /boot/efi
├─mmcblk1p2 179:2    0   488M  0 part /boot
└─mmcblk1p3 179:3    0  11.4G  0 part
zram0       251:0    0     8G  0 disk [SWAP]
nvme1n1     259:0    0 953.9G  0 disk
└─nvme1n1p1 259:1    0 953.9G  0 part /
nvme0n1     259:2    0 238.5G  0 disk
└─nvme0n1p1 259:3    0 238.5G  0 part
[milkv@fedora-riscv ~]$ [  102.972401] usb 1-1.1: new full-speed USB device number 4 using xhci_hcd
[  103.240918] usb 1-1.1: New USB device found, idVendor=25a7, idProduct=fa61, bcdDevice= 6.23
[  103.249314] usb 1-1.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  103.256653] usb 1-1.1: Product: 2.4G Receiver
[  103.261007] usb 1-1.1: Manufacturer: Compx
[  103.428111] hid: raw HID events driver (C) Jiri Kosina
[  103.515965] usbcore: registered new interface driver usbhid
[  103.521555] usbhid: USB HID core driver
[  103.545430] usbcore: registered new interface driver usbmouse
[  103.548181] usbcore: registered new interface driver usbkbd
[  103.604197] input: Compx 2.4G Receiver as /devices/platform/4c00000000.pcie/pci0003:c0/0003:c0:00.0/0003:c1:00.0/0003:c2:04.0/0003:c4:0
0.0/usb1/1-1/1-1.1/1-1.1:1.0/0003:25A7:FA61.0001/input/input0
[  103.683406] hid-generic 0003:25A7:FA61.0001: input,hidraw0: USB HID v1.10 Keyboard [Compx 2.4G Receiver] on usb-0003:c4:00.0-1.1/input0
[  103.696498] input: Compx 2.4G Receiver Mouse as /devices/platform/4c00000000.pcie/pci0003:c0/0003:c0:00.0/0003:c1:00.0/0003:c2:04.0/000
3:c4:00.0/usb1/1-1/1-1.1/1-1.1:1.1/0003:25A7:FA61.0002/input/input1
[  103.715122] input: Compx 2.4G Receiver as /devices/platform/4c00000000.pcie/pci0003:c0/0003:c0:00.0/0003:c1:00.0/0003:c2:04.0/0003:c4:0
0.0/usb1/1-1/1-1.1/1-1.1:1.1/0003:25A7:FA61.0002/input/input2
[  103.733000] input: Compx 2.4G Receiver Consumer Control as /devices/platform/4c00000000.pcie/pci0003:c0/0003:c0:00.0/0003:c1:00.0/0003:
c2:04.0/0003:c4:00.0/usb1/1-1/1-1.1/1-1.1:1.1/0003:25A7:FA61.0002/input/input3
[  103.812780] input: Compx 2.4G Receiver System Control as /devices/platform/4c00000000.pcie/pci0003:c0/0003:c0:00.0/0003:c1:00.0/0003:c2
:04.0/0003:c4:00.0/usb1/1-1/1-1.1/1-1.1:1.1/0003:25A7:FA61.0002/input/input4
[  103.832296] hid-generic 0003:25A7:FA61.0002: input,hiddev0,hidraw1: USB HID v1.10 Mouse [Compx 2.4G Receiver] on usb-0003:c4:00.0-1.1/i
nput1
[  123.425406] EXT4-fs (mmcblk1p3): mounted filesystem with ordered data mode. Quota mode: none.
[  176.252418] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:27]
[  176.259659] Modules linked in: hid_generic usbkbd usbmouse usbhid hid amdgpu nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct ip_set gpu_sched drm_buddy qrtr radeon sunrpc snd_hda_intel snd_intel_dspc
fg drm_display_helper snd_hda_codec drm_ttm_helper nls_iso8859_1 snd_hda_core ttm snd_pcm drm_kms_helper snd_timer snd syscopyarea ahci sy
sfillrect sysimgblt libahci fb_sys_fops i2c_algo_bit soundcore uio_pdrv_genirq uio sch_fq_codel drm fuse zram autofs4 dm_multipath(E)
[  176.307420] CPU: 2 PID: 27 Comm: kworker/2:0 Tainted: G            E      6.1.22 #2
[  176.315072] Hardware name: Sophgo Mango (DT)
[  176.319339] Workqueue: rcu_gp wait_rcu_exp_gp
[  176.323709] epc : smp_call_function_single+0xbc/0x12a
[  176.328764]  ra : __sync_rcu_exp_select_node_cpus+0x27e/0x414
[  176.334507] epc : ffffffff800ba4aa ra : ffffffff8008db72 sp : ffffffc80a583c80
[  176.341721]  gp : ffffffff81cd4790 tp : ffffffd8ffe03d80 t0 : 0000000000008000
[  176.348934]  t1 : 0000000000f00000 t2 : 0000ff0000000000 s0 : ffffffc80a583cd0
[  176.356148]  s1 : ffffffc80a583c80 a0 : 0000000000000039 a1 : fffffff6df74a100
[  176.363361]  a2 : 0000000000000000 a3 : 0000000000000000 a4 : fffffff75e6ad000
[  176.370573]  a5 : 0000000000000001 a6 : ffffffff8008d864 a7 : 0000000000000000
[  176.377786]  s2 : ffffffff81b0f540 s3 : ffffffff8109cf00 s4 : 0000000000003208
[  176.384999]  s5 : ffffffff81099780 s6 : 0000000000000200 s7 : ffffffff8008d864
[  176.392213]  s8 : ffffffff81b0f730 s9 : 000000000000cdf7 s10: ffffffff81d5b588
[  176.399425]  s11: fffffff6dfd4df00 t3 : 0000000000000040 t4 : 000000000000ffff
[  176.406638]  t5 : 00000000ffffffff t6 : 0000000000008000
[  176.411942] status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005
[  176.419851] [<ffffffff800ba4aa>] smp_call_function_single+0xbc/0x12a
[  176.426202] [<ffffffff8008db72>] __sync_rcu_exp_select_node_cpus+0x27e/0x414
[  176.433245] [<ffffffff8008de9c>] sync_rcu_exp_select_cpus+0x172/0x2e2
[  176.439681] [<ffffffff80091426>] wait_rcu_exp_gp+0x1e/0x32
[  176.445162] [<ffffffff80032cc4>] process_one_work+0x1c2/0x396
[  176.450904] [<ffffffff80032fc2>] worker_thread+0x12a/0x408
[  176.456385] [<ffffffff8003af28>] kthread+0xbc/0xd2
[  176.461174] [<ffffffff80003b18>] ret_from_exception+0x0/0x16
[  183.622433] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 42s!
[  183.630383] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 42s!
[  183.638324] BUG: workqueue lockup - pool cpus=7 node=0 flags=0x0 nice=0 stuck for 42s!
[  183.646266] BUG: workqueue lockup - pool cpus=28 node=1 flags=0x0 nice=0 stuck for 31s!
[  183.654284] BUG: workqueue lockup - pool cpus=31 node=1 flags=0x0 nice=0 stuck for 42s!
[  183.662302] BUG: workqueue lockup - pool cpus=60 node=3 flags=0x0 nice=0 stuck for 42s!
[  183.670318] BUG: workqueue lockup - pool cpus=61 node=3 flags=0x0 nice=0 stuck for 43s!
[  183.678340] Showing busy workqueues and worker pools:
[  183.683399] workqueue events: flags=0x0
[  183.687232]   pwq 122: cpus=61 node=3 flags=0x0 nice=0 active=4/256 refcnt=5
[  183.687244]     in-flight: 521:bpf_prog_free_deferred
[  183.687262]     pending: bpf_prog_free_deferred, bpf_prog_free_deferred, bpf_prog_free_deferred
[  183.687292]   pwq 58: cpus=29 node=1 flags=0x0 nice=0 active=5/256 refcnt=6
[  183.687302]     in-flight: 163:bpf_prog_free_deferred, 3566:output_poll_execute [drm_kms_helper], 3564:bpf_prog_free_deferred, 1376:bpf
_prog_free_deferred, 1194:bpf_prog_free_deferred
[  183.687828]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=4/256 refcnt=5
[  183.687841]     pending: psi_avgs_work, psi_avgs_work, kfree_rcu_monitor, psi_avgs_work
[  183.687915] workqueue events_power_efficient: flags=0x80
[  183.752128]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.752139]     in-flight: 387:radeon_fence_check_lockup [radeon]
[  183.771771] workqueue rcu_gp: flags=0x8
[  183.775643]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.775655]     in-flight: 27:wait_rcu_exp_gp
[  183.775687] workqueue rcu_par_gp: flags=0x8
[  183.791026]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256 refcnt=4
[  183.791035]     pending: sync_rcu_exp_select_node_cpus, sync_rcu_exp_select_node_cpus, sync_rcu_exp_select_node_cpus
[  183.791065] workqueue mm_percpu_wq: flags=0x8
[  183.812731]   pwq 120: cpus=60 node=3 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.812742]     pending: vmstat_update
[  183.812762]   pwq 62: cpus=31 node=1 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.812771]     pending: vmstat_update
[  183.812777]   pwq 56: cpus=28 node=1 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.812785]     pending: vmstat_update
[  183.812799]   pwq 14: cpus=7 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.812808]     pending: vmstat_update
[  183.812813]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.812821]     pending: vmstat_update
[  183.812826]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.812835]     pending: vmstat_update
[  183.812839]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  183.812847]     pending: vmstat_update
[  183.813306] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=30s workers=3 idle: 424 415
[  183.813324] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=42s workers=5 idle: 357 47 459 460
[  183.813359] pool 58: cpus=29 node=1 flags=0x0 nice=0 hung=36s workers=6 idle: 3565
[  183.813397] pool 122: cpus=61 node=3 flags=0x0 nice=0 hung=43s workers=2 idle: 325

测试情况

为了方便描述测试情况,定义如下几个插槽位号: 以网口面朝自己,3 个 PCIe 插槽中,从左到右分别为 PCIe1、PCIe2、PCIe3;2 个 M.2 插槽中,靠近自己的为 M.2-1,远离的为 M.2-2。

显卡型号为 R5 230,SSD 品牌为金士顿。测试板子序号为 10、11、12 和 13。

同时发现,当显卡插在 PCIe1 时,使用 dd 命令在两个 SSD 之间复制,系统也会崩溃,报错信息和上面相同。

lbfs commented 7 months ago

Were you able to resolve this issue?

I see this exact same problem at boot-up... f88ea7effafaead61388c15c6da87327fc61a625