Open MingcongBai opened 11 months ago
尝试测试 loongnix 下进行相同操作。在一台 3C5000 上使用更新到最新版本的 loongnix 20.5:
qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd
打开的 qemu 窗口会停在 "guest has not initialized the display yet",同时终端里输出:
/sys/devices/system/cpu/cpu0/cpufreq/ cpuinfo_max_freq not exist!
Try /proc/cpuinfo...
不过即使不指定 -accel kvm
也会卡在这里。
上面出现的问题是因为使用了新世界 EFI 镜像导致的,使用 Loongnix 提供的 OVMF 镜像后,一切工作正常。看起来上面报告的问题可能是新世界系统特有的。
内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/
内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/
我使用了 Yongbai 20231201 作为宿主系统测试,内核依然出现与其他发行版一样的症状:
[ 2023-12-12T20:50:45+08:00 ] [ 269.598480] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:50:45+08:00 ] [ 269.604563] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=2626
[ 2023-12-12T20:50:45+08:00 ] [ 269.613474] rcu: (detected by 1, t=5255 jiffies, g=5917, q=141 ncpus=16)
[ 2023-12-12T20:50:55+08:00 ] [ 279.620509] rcu: rcu_preempt kthread starved for 2494 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=5
[ 2023-12-12T20:50:55+08:00 ] [ 279.630711] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:50:55+08:00 ] [ 279.639788] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:50:55+08:00 ] [ 279.644951] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:51:58+08:00 ] [ 342.669638] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:51:58+08:00 ] [ 342.675698] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=10509
[ 2023-12-12T20:51:58+08:00 ] [ 342.684692] rcu: (detected by 1, t=23520 jiffies, g=5917, q=479 ncpus=16)
[ 2023-12-12T20:52:08+08:00 ] [ 352.691810] rcu: rcu_preempt kthread timer wakeup didn't happen for 2502 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 2023-12-12T20:52:08+08:00 ] [ 352.703047] rcu: Possible timer handling issue on cpu=1 timer-softirq=3498
[ 2023-12-12T20:52:08+08:00 ] [ 352.709963] rcu: rcu_preempt kthread starved for 2508 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ 2023-12-12T20:52:08+08:00 ] [ 352.720250] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:52:08+08:00 ] [ 352.729326] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:52:08+08:00 ] [ 352.734465] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:53:11+08:00 ] [ 415.752308] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:53:11+08:00 ] [ 415.758369] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=17582
[ 2023-12-12T20:53:11+08:00 ] [ 415.767364] rcu: (detected by 1, t=41791 jiffies, g=5917, q=1049 ncpus=16)
[ 2023-12-12T20:53:21+08:00 ] [ 425.774569] rcu: rcu_preempt kthread starved for 2495 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:53:21+08:00 ] [ 425.784857] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:53:21+08:00 ] [ 425.793933] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:53:21+08:00 ] [ 425.799075] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:54:24+08:00 ] [ 488.814843] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:54:24+08:00 ] [ 488.820905] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=23006
[ 2023-12-12T20:54:24+08:00 ] [ 488.829901] rcu: (detected by 11, t=60060 jiffies, g=5917, q=1232 ncpus=16)
[ 2023-12-12T20:54:34+08:00 ] [ 498.837194] rcu: rcu_preempt kthread starved for 2498 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:54:34+08:00 ] [ 498.847481] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:54:34+08:00 ] [ 498.856557] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:54:34+08:00 ] [ 498.861699] rcu: Stack dump where RCU GP kthread last ran:
硬件为龙芯 3C5000 + 7A2000,内存 128 GB
内核版本:
[root@Sunhaiyong ~]# uname -a
Linux Sunhaiyong 6.7.0-rc1 #1 SMP PREEMPT Thu Nov 30 02:07:13 UTC 2023 loongarch64 GNU/Linux
尝试在 Yongbai 20231201 上编译 QEMU 时发生工具链相关的问题:
collect2 版本 14.0.0 20231117 (experimental)
/usr/bin/ld -plugin /usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/lto-wrapper -plugin-opt=-fresolution=/tmp/ccbJ3P9u.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr --hash-style=gnu -m elf64loongarch -dynamic-linker /lib64/ld-linux-loongarch-lp64d.so.1 /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crt1.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crti.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtbegin.o -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0 -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../.. -L/lib64 -L/usr/lib64 --version -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtend.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crtn.o
-----------
Sanity testing C compiler: cc
Is cross compiler: False.
Sanity check compiler command line: cc sanitycheckc.c -o sanitycheckc.exe -D_FILE_OFFSET_BITS=64
Sanity check compile stdout:
-----
Sanity check compile stderr:
/usr/bin/ld: 找不到 /usr/lib64/libc_nonshared.a: 没有那个文件或目录
collect2: 错误:ld 返回 1
-----
../meson.build:1:0: ERROR: Compiler cc cannot compile programs.
[root@Sunhaiyong qemu]#
内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/
我使用了 Yongbai 20231201 作为宿主系统测试,内核依然出现与其他发行版一样的症状:
是物理机内核还是虚拟机内核报这个rcu 错误?
[ 2023-12-12T20:50:45+08:00 ] [ 269.598480] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2023-12-12T20:50:45+08:00 ] [ 269.604563] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=2626 [ 2023-12-12T20:50:45+08:00 ] [ 269.613474] rcu: (detected by 1, t=5255 jiffies, g=5917, q=141 ncpus=16) [ 2023-12-12T20:50:55+08:00 ] [ 279.620509] rcu: rcu_preempt kthread starved for 2494 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=5 [ 2023-12-12T20:50:55+08:00 ] [ 279.630711] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 2023-12-12T20:50:55+08:00 ] [ 279.639788] rcu: RCU grace-period kthread stack dump: [ 2023-12-12T20:50:55+08:00 ] [ 279.644951] rcu: Stack dump where RCU GP kthread last ran: [ 2023-12-12T20:51:58+08:00 ] [ 342.669638] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2023-12-12T20:51:58+08:00 ] [ 342.675698] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=10509 [ 2023-12-12T20:51:58+08:00 ] [ 342.684692] rcu: (detected by 1, t=23520 jiffies, g=5917, q=479 ncpus=16) [ 2023-12-12T20:52:08+08:00 ] [ 352.691810] rcu: rcu_preempt kthread timer wakeup didn't happen for 2502 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 2023-12-12T20:52:08+08:00 ] [ 352.703047] rcu: Possible timer handling issue on cpu=1 timer-softirq=3498 [ 2023-12-12T20:52:08+08:00 ] [ 352.709963] rcu: rcu_preempt kthread starved for 2508 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1 [ 2023-12-12T20:52:08+08:00 ] [ 352.720250] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 2023-12-12T20:52:08+08:00 ] [ 352.729326] rcu: RCU grace-period kthread stack dump: [ 2023-12-12T20:52:08+08:00 ] [ 352.734465] rcu: Stack dump where RCU GP kthread last ran: [ 2023-12-12T20:53:11+08:00 ] [ 415.752308] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2023-12-12T20:53:11+08:00 ] [ 415.758369] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=17582 [ 2023-12-12T20:53:11+08:00 ] [ 415.767364] rcu: (detected by 1, t=41791 jiffies, g=5917, q=1049 ncpus=16) [ 2023-12-12T20:53:21+08:00 ] [ 425.774569] rcu: rcu_preempt kthread starved for 2495 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15 [ 2023-12-12T20:53:21+08:00 ] [ 425.784857] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 2023-12-12T20:53:21+08:00 ] [ 425.793933] rcu: RCU grace-period kthread stack dump: [ 2023-12-12T20:53:21+08:00 ] [ 425.799075] rcu: Stack dump where RCU GP kthread last ran: [ 2023-12-12T20:54:24+08:00 ] [ 488.814843] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2023-12-12T20:54:24+08:00 ] [ 488.820905] rcu: 3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=23006 [ 2023-12-12T20:54:24+08:00 ] [ 488.829901] rcu: (detected by 11, t=60060 jiffies, g=5917, q=1232 ncpus=16) [ 2023-12-12T20:54:34+08:00 ] [ 498.837194] rcu: rcu_preempt kthread starved for 2498 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15 [ 2023-12-12T20:54:34+08:00 ] [ 498.847481] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 2023-12-12T20:54:34+08:00 ] [ 498.856557] rcu: RCU grace-period kthread stack dump: [ 2023-12-12T20:54:34+08:00 ] [ 498.861699] rcu: Stack dump where RCU GP kthread last ran:
硬件为龙芯 3C5000 + 7A2000,内存 128 GB
内核版本:
[root@Sunhaiyong ~]# uname -a Linux Sunhaiyong 6.7.0-rc1 #1 SMP PREEMPT Thu Nov 30 02:07:13 UTC 2023 loongarch64 GNU/Linux
尝试在 Yongbai 20231201 上编译 QEMU 时发生工具链相关的问题:
collect2 版本 14.0.0 20231117 (experimental) /usr/bin/ld -plugin /usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/lto-wrapper -plugin-opt=-fresolution=/tmp/ccbJ3P9u.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr --hash-style=gnu -m elf64loongarch -dynamic-linker /lib64/ld-linux-loongarch-lp64d.so.1 /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crt1.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crti.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtbegin.o -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0 -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../.. -L/lib64 -L/usr/lib64 --version -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtend.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crtn.o ----------- Sanity testing C compiler: cc Is cross compiler: False. Sanity check compiler command line: cc sanitycheckc.c -o sanitycheckc.exe -D_FILE_OFFSET_BITS=64 Sanity check compile stdout: ----- Sanity check compile stderr: /usr/bin/ld: 找不到 /usr/lib64/libc_nonshared.a: 没有那个文件或目录 collect2: 错误:ld 返回 1 ----- ../meson.build:1:0: ERROR: Compiler cc cannot compile programs. [root@Sunhaiyong qemu]# QEMU 编译命令是什么,我这边在openEuler系统上编译社区qemu 是可以的。
@bibo-mao 上面是宿主内核报错,至于编译问题,后来从 @sunhaiyong1978 得知是 Yongbao 需要打开开发相关组件才能编译,明天 @liushuyu 会继续测试
根据 @chenhuacai 收到的提示,我们更新了目前尚未合并的 KVM LSX/LASX 补丁,并将其搭配 loongarch-next 分支补丁应用到 6.7.0-rc5 内核上,原帖中的症状没有变化
有机器可以远程登录吗,我们看一下原因。 我们这边测试过3C5000 双路、3C5000单路、3A6000单路没发现host上报rcu 问题,只是guest运行压力测试在guest上报rcu 超时问题
有机器可以远程登录吗,我们看一下原因。 我们这边测试过3C5000 双路、3C5000单路、3A6000单路没发现host上报rcu 问题,只是guest运行压力测试在guest上报rcu 超时问题
已联系并提供访问
经过调查,我们发现这个问题报告一部分是摆乌龙了(我已经用删除线标记乌龙部分):
console=ttyS0,115200
,否则不会有任何输出(先前复位的原因是其实是内核找不到硬盘,kernel panic 了)-bios
导致 3C5000 宿主机死机的问题依然成立-vga
参数不能用,但是如果指定 -device virtio-gpu-pci
则不需要指定上述串口参数@bibo-mao
开了 LSX 优化的系统都会出现 SIGILL 错误,但属于另外一个报告的范畴,详见 #24
问题描述
在 3C5000 上使用如下命令启动带 KVM 加速的 QEMU,宿主机图形界面会卡死(SSH 依然可用):
此时,内核会不定时输出诸如
workqueue lockup
或watchdog: BUG: soft lockup - CPU#8 stuck for 33s! [QSGRenderThread:1532]
乃至watchdog: Watchdog detected hard LOCKUP on cpu 15
等错误;如附图中两例:如从 https://mirrors.wsyu.edu.cn/loongarch/archlinux/images/ 下载 QEMU-EFI-8.1.fd,并指定
-bios
参数:则一切正常,可以启动到 EFI Shell。
但是,问题还没结束,如果此时下载上述链接中的 minimal 镜像并指定镜像启动:qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd -hda https://mirrors.wsyu.edu.cn/loongarch/archlinux/images/archlinux-minimal-2023.05.10-loong64.qcow2QEMU 能够启动到 GRUB,但按回车引导系统后,客户机终端只会输出几行,在一段时间后便会复位重启:MemoryMapPteRange 507 Address DCE0000 End DD20000 Attributes 53SetUefiImageMemoryAttributes - 0x000000000DC40000 - 0x0000000000040000 (0x0000000000000000)这一部分的问题是因为没有在客户机指定
console=ttyS0,115200
内核参数导致的(先前测试的同事没有提到这点),属于乌龙;但不指定-bios
参数导致宿主机内核故障的问题依然存在;如指定-device virtio-gpu-pci
参数则不需要附加串口参数调试操作
我们已尝试过如下操作,均无法缓解问题(症状一致):
nr_cpus=4
内核参数限制核心数量为 4nr_cpus=4
启动系统运行环境
附注
同样测试环境,在 3A5000 及 3A6000 平台均无法复现问题:
-bios
参数均不会导致宿主系统死机