loongson-community / discussions

Cross-community issue tracker & discussions / 跨社区工单追踪 & 讨论场所
11 stars 0 forks source link

QEMU (KVM) 在 3C5000 上工作不正常 #25

Open MingcongBai opened 11 months ago

MingcongBai commented 11 months ago

问题描述

在 3C5000 上使用如下命令启动带 KVM 加速的 QEMU,宿主机图形界面会卡死(SSH 依然可用):

qemu-system-loongarch64 -accel kvm

此时,内核会不定时输出诸如 workqueue lockupwatchdog: BUG: soft lockup - CPU#8 stuck for 33s! [QSGRenderThread:1532] 乃至 watchdog: Watchdog detected hard LOCKUP on cpu 15 等错误;如附图中两例:

图片 图片

如从 https://mirrors.wsyu.edu.cn/loongarch/archlinux/images/ 下载 QEMU-EFI-8.1.fd,并指定 -bios 参数:

qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd

则一切正常,可以启动到 EFI Shell。

但是,问题还没结束,如果此时下载上述链接中的 minimal 镜像并指定镜像启动:

qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd -hda https://mirrors.wsyu.edu.cn/loongarch/archlinux/images/archlinux-minimal-2023.05.10-loong64.qcow2

QEMU 能够启动到 GRUB,但按回车引导系统后,客户机终端只会输出几行,在一段时间后便会复位重启:

MemoryMapPteRange 507 Address DCE0000 End DD20000 Attributes 53 SetUefiImageMemoryAttributes - 0x000000000DC40000 - 0x0000000000040000 (0x0000000000000000)

这一部分的问题是因为没有在客户机指定 console=ttyS0,115200 内核参数导致的(先前测试的同事没有提到这点),属于乌龙;但不指定 -bios 参数导致宿主机内核故障的问题依然存在;如指定 -device virtio-gpu-pci 参数则不需要附加串口参数

调试操作

我们已尝试过如下操作,均无法缓解问题(症状一致):

运行环境

附注

同样测试环境,在 3A5000 及 3A6000 平台均无法复现问题:

cthbleachbit commented 11 months ago

尝试测试 loongnix 下进行相同操作。在一台 3C5000 上使用更新到最新版本的 loongnix 20.5:

qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd 打开的 qemu 窗口会停在 "guest has not initialized the display yet",同时终端里输出:

/sys/devices/system/cpu/cpu0/cpufreq/     cpuinfo_max_freq not exist!
Try /proc/cpuinfo...

不过即使不指定 -accel kvm 也会卡在这里。

MingcongBai commented 11 months ago

上面出现的问题是因为使用了新世界 EFI 镜像导致的,使用 Loongnix 提供的 OVMF 镜像后,一切工作正常。看起来上面报告的问题可能是新世界系统特有的。

chenhuacai commented 11 months ago

内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/

liushuyu commented 11 months ago

内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/

我使用了 Yongbai 20231201 作为宿主系统测试,内核依然出现与其他发行版一样的症状:

[ 2023-12-12T20:50:45+08:00 ] [  269.598480] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:50:45+08:00 ] [  269.604563] rcu:   3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=2626
[ 2023-12-12T20:50:45+08:00 ] [  269.613474] rcu:   (detected by 1, t=5255 jiffies, g=5917, q=141 ncpus=16)
[ 2023-12-12T20:50:55+08:00 ] [  279.620509] rcu: rcu_preempt kthread starved for 2494 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=5
[ 2023-12-12T20:50:55+08:00 ] [  279.630711] rcu:   Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:50:55+08:00 ] [  279.639788] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:50:55+08:00 ] [  279.644951] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:51:58+08:00 ] [  342.669638] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:51:58+08:00 ] [  342.675698] rcu:   3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=10509
[ 2023-12-12T20:51:58+08:00 ] [  342.684692] rcu:   (detected by 1, t=23520 jiffies, g=5917, q=479 ncpus=16)
[ 2023-12-12T20:52:08+08:00 ] [  352.691810] rcu: rcu_preempt kthread timer wakeup didn't happen for 2502 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 2023-12-12T20:52:08+08:00 ] [  352.703047] rcu:   Possible timer handling issue on cpu=1 timer-softirq=3498
[ 2023-12-12T20:52:08+08:00 ] [  352.709963] rcu: rcu_preempt kthread starved for 2508 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ 2023-12-12T20:52:08+08:00 ] [  352.720250] rcu:   Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:52:08+08:00 ] [  352.729326] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:52:08+08:00 ] [  352.734465] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:53:11+08:00 ] [  415.752308] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:53:11+08:00 ] [  415.758369] rcu:   3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=17582
[ 2023-12-12T20:53:11+08:00 ] [  415.767364] rcu:   (detected by 1, t=41791 jiffies, g=5917, q=1049 ncpus=16)
[ 2023-12-12T20:53:21+08:00 ] [  425.774569] rcu: rcu_preempt kthread starved for 2495 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:53:21+08:00 ] [  425.784857] rcu:   Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:53:21+08:00 ] [  425.793933] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:53:21+08:00 ] [  425.799075] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:54:24+08:00 ] [  488.814843] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:54:24+08:00 ] [  488.820905] rcu:   3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=23006
[ 2023-12-12T20:54:24+08:00 ] [  488.829901] rcu:   (detected by 11, t=60060 jiffies, g=5917, q=1232 ncpus=16)
[ 2023-12-12T20:54:34+08:00 ] [  498.837194] rcu: rcu_preempt kthread starved for 2498 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:54:34+08:00 ] [  498.847481] rcu:   Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:54:34+08:00 ] [  498.856557] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:54:34+08:00 ] [  498.861699] rcu: Stack dump where RCU GP kthread last ran:

硬件为龙芯 3C5000 + 7A2000,内存 128 GB

内核版本:

[root@Sunhaiyong ~]# uname -a
Linux Sunhaiyong 6.7.0-rc1 #1 SMP PREEMPT Thu Nov 30 02:07:13 UTC 2023 loongarch64 GNU/Linux

尝试在 Yongbai 20231201 上编译 QEMU 时发生工具链相关的问题:

collect2 版本 14.0.0 20231117 (experimental)
/usr/bin/ld -plugin /usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/lto-wrapper -plugin-opt=-fresolution=/tmp/ccbJ3P9u.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr --hash-style=gnu -m elf64loongarch -dynamic-linker /lib64/ld-linux-loongarch-lp64d.so.1 /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crt1.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crti.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtbegin.o -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0 -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../.. -L/lib64 -L/usr/lib64 --version -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtend.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crtn.o
-----------
Sanity testing C compiler: cc
Is cross compiler: False.
Sanity check compiler command line: cc sanitycheckc.c -o sanitycheckc.exe -D_FILE_OFFSET_BITS=64
Sanity check compile stdout:

-----
Sanity check compile stderr:
/usr/bin/ld: 找不到 /usr/lib64/libc_nonshared.a: 没有那个文件或目录
collect2: 错误:ld 返回 1

-----

../meson.build:1:0: ERROR: Compiler cc cannot compile programs.
[root@Sunhaiyong qemu]#
bibo-mao commented 11 months ago

内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/

我使用了 Yongbai 20231201 作为宿主系统测试,内核依然出现与其他发行版一样的症状:

是物理机内核还是虚拟机内核报这个rcu 错误?

[ 2023-12-12T20:50:45+08:00 ] [  269.598480] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:50:45+08:00 ] [  269.604563] rcu:     3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=2626
[ 2023-12-12T20:50:45+08:00 ] [  269.613474] rcu:     (detected by 1, t=5255 jiffies, g=5917, q=141 ncpus=16)
[ 2023-12-12T20:50:55+08:00 ] [  279.620509] rcu: rcu_preempt kthread starved for 2494 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=5
[ 2023-12-12T20:50:55+08:00 ] [  279.630711] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:50:55+08:00 ] [  279.639788] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:50:55+08:00 ] [  279.644951] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:51:58+08:00 ] [  342.669638] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:51:58+08:00 ] [  342.675698] rcu:     3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=10509
[ 2023-12-12T20:51:58+08:00 ] [  342.684692] rcu:     (detected by 1, t=23520 jiffies, g=5917, q=479 ncpus=16)
[ 2023-12-12T20:52:08+08:00 ] [  352.691810] rcu: rcu_preempt kthread timer wakeup didn't happen for 2502 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 2023-12-12T20:52:08+08:00 ] [  352.703047] rcu:     Possible timer handling issue on cpu=1 timer-softirq=3498
[ 2023-12-12T20:52:08+08:00 ] [  352.709963] rcu: rcu_preempt kthread starved for 2508 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ 2023-12-12T20:52:08+08:00 ] [  352.720250] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:52:08+08:00 ] [  352.729326] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:52:08+08:00 ] [  352.734465] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:53:11+08:00 ] [  415.752308] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:53:11+08:00 ] [  415.758369] rcu:     3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=17582
[ 2023-12-12T20:53:11+08:00 ] [  415.767364] rcu:     (detected by 1, t=41791 jiffies, g=5917, q=1049 ncpus=16)
[ 2023-12-12T20:53:21+08:00 ] [  425.774569] rcu: rcu_preempt kthread starved for 2495 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:53:21+08:00 ] [  425.784857] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:53:21+08:00 ] [  425.793933] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:53:21+08:00 ] [  425.799075] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:54:24+08:00 ] [  488.814843] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:54:24+08:00 ] [  488.820905] rcu:     3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=23006
[ 2023-12-12T20:54:24+08:00 ] [  488.829901] rcu:     (detected by 11, t=60060 jiffies, g=5917, q=1232 ncpus=16)
[ 2023-12-12T20:54:34+08:00 ] [  498.837194] rcu: rcu_preempt kthread starved for 2498 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:54:34+08:00 ] [  498.847481] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:54:34+08:00 ] [  498.856557] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:54:34+08:00 ] [  498.861699] rcu: Stack dump where RCU GP kthread last ran:

硬件为龙芯 3C5000 + 7A2000,内存 128 GB

内核版本:

[root@Sunhaiyong ~]# uname -a
Linux Sunhaiyong 6.7.0-rc1 #1 SMP PREEMPT Thu Nov 30 02:07:13 UTC 2023 loongarch64 GNU/Linux

尝试在 Yongbai 20231201 上编译 QEMU 时发生工具链相关的问题:

collect2 版本 14.0.0 20231117 (experimental)
/usr/bin/ld -plugin /usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/lto-wrapper -plugin-opt=-fresolution=/tmp/ccbJ3P9u.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr --hash-style=gnu -m elf64loongarch -dynamic-linker /lib64/ld-linux-loongarch-lp64d.so.1 /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crt1.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crti.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtbegin.o -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0 -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../.. -L/lib64 -L/usr/lib64 --version -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtend.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crtn.o
-----------
Sanity testing C compiler: cc
Is cross compiler: False.
Sanity check compiler command line: cc sanitycheckc.c -o sanitycheckc.exe -D_FILE_OFFSET_BITS=64
Sanity check compile stdout:

-----
Sanity check compile stderr:
/usr/bin/ld: 找不到 /usr/lib64/libc_nonshared.a: 没有那个文件或目录
collect2: 错误:ld 返回 1

-----

../meson.build:1:0: ERROR: Compiler cc cannot compile programs.
[root@Sunhaiyong qemu]#
QEMU 编译命令是什么,我这边在openEuler系统上编译社区qemu 是可以的。
MingcongBai commented 11 months ago

@bibo-mao 上面是宿主内核报错,至于编译问题,后来从 @sunhaiyong1978 得知是 Yongbao 需要打开开发相关组件才能编译,明天 @liushuyu 会继续测试

MingcongBai commented 11 months ago

根据 @chenhuacai 收到的提示,我们更新了目前尚未合并的 KVM LSX/LASX 补丁,并将其搭配 loongarch-next 分支补丁应用到 6.7.0-rc5 内核上,原帖中的症状没有变化

bibo-mao commented 11 months ago

有机器可以远程登录吗,我们看一下原因。 我们这边测试过3C5000 双路、3C5000单路、3A6000单路没发现host上报rcu 问题,只是guest运行压力测试在guest上报rcu 超时问题

MingcongBai commented 11 months ago

有机器可以远程登录吗,我们看一下原因。 我们这边测试过3C5000 双路、3C5000单路、3A6000单路没发现host上报rcu 问题,只是guest运行压力测试在guest上报rcu 超时问题

已联系并提供访问

MingcongBai commented 11 months ago

经过调查,我们发现这个问题报告一部分是摆乌龙了(我已经用删除线标记乌龙部分):

  1. Qemu 启动虚拟机,必须指定 console=ttyS0,115200,否则不会有任何输出(先前复位的原因是其实是内核找不到硬盘,kernel panic 了)
  2. 不指定 -bios 导致 3C5000 宿主机死机的问题依然成立
  3. 看起来 -vga 参数不能用,但是如果指定 -device virtio-gpu-pci 则不需要指定上述串口参数

@bibo-mao

MingcongBai commented 11 months ago

开了 LSX 优化的系统都会出现 SIGILL 错误,但属于另外一个报告的范畴,详见 #24