Open mikeBashStuff opened 2 weeks ago
What's confusing to me is this:
[Thu Jun 20 14:13:53 2024] dpdk-vhost-evt[742863]: segfault at b9 ip 00000000004d0c12 sp 00007f2b31bfe708 error 4 in vhost[407000+28c000] likely on CPU 6 (core 6, socket 0)
says that the dpdk-vhost-evt
thread got hit while spinning on cpu6. This is weird, since vhost
app is told to execute under cpu0 only. Is this somehow ignored? Or is kernel simply mistaken? I haven't tracked these threads to see what cpu affinity they are actually running with, but still, quite surprising.
What's confusing to me is this:
[Thu Jun 20 14:13:53 2024] dpdk-vhost-evt[742863]: segfault at b9 ip 00000000004d0c12 sp 00007f2b31bfe708 error 4 in vhost[407000+28c000] likely on CPU 6 (core 6, socket 0)
says that the
dpdk-vhost-evt
thread got hit while spinning on cpu6. This is weird, sincevhost
app is told to execute under cpu0 only. Is this somehow ignored? Or is kernel simply mistaken? I haven't tracked these threads to see what cpu affinity they are actually running with, but still, quite surprising.
DPDK processes vhost events on a separate pthread - so this is expected for that dpdk-vhost-evt thread.
Some additional info regarding the state of the VM:
When vhost
crashes, the vhost-user-blk-pci
part from the QEMU side does attempt to close all the chardevs that were previously connected. However, even though they are considered as "disconnected" they are still busy and cannot be removed:
char_Nvme0n1p0: filename=disconnected:unix:/root/vhost_test/vhost/0/naa.Nvme0n1p0.0
char_Nvme0n1p1: filename=disconnected:unix:/root/vhost_test/vhost/0/naa.Nvme0n1p1.0
seabios: filename=file
compat_monitor0: filename=telnet:127.0.0.1:10002,server=on <-> 127.0.0.1:38708
serial0: filename=pty:/dev/pts/2
(qemu) char
chardev-add chardev-change chardev-remove chardev-send-break
(qemu) chardev-remove
char_Nvme0n1p0 char_Nvme0n1p1 compat_monitor0 parallel0
seabios serial0
(qemu) chardev-remove ch
char_Nvme0n1p0 char_Nvme0n1p1
(qemu) chardev-remove char_Nvme0n1p0
Error: Chardev 'char_Nvme0n1p0' is busy
(qemu) chardev-remove char_Nvme0n1p1
Error: Chardev 'char_Nvme0n1p1' is busy
(qemu)
This I guess is sort of expected, as at this point kernel inside the guest VM still sees related block devices, but an attempt to perform any sort of operation against these devices leaves given process stuck inside the kernel - as mentioned in the initial report, the system inside the system inside the VM is essentially dead. There's a chance to somewhat operate inside the VM when the boot sequence is kept to very minimal (e.g. by dropping initrd). In such a case, it's possible to observe where kernel is getting stuck. E.g.:
[root@vhostfedora-cloud-23052 ~]# echo w >/proc/sysrq-trigger
[root@vhostfedora-cloud-23052 ~]# for p in \
> $(dmesg | sed -n 's/.*[^p]pid:\([0-9]\+\).*/\1/p'); do
> echo "$p:"; cat "/proc/$p/stack"; done
254:
[<0>] folio_wait_bit_common+0x13d/0x350
[<0>] do_read_cache_folio+0x12a/0x190
[<0>] read_part_sector+0x36/0xb0
[<0>] sgi_partition+0x3f/0x350
[<0>] bdev_disk_changed+0x2aa/0x700
[<0>] blkdev_get_whole+0x7a/0x90
[<0>] blkdev_get_by_dev.part.0+0x174/0x320
[<0>] disk_scan_partitions+0x69/0xe0
[<0>] device_add_disk+0x3bb/0x3c0
[<0>] virtblk_probe+0x8a2/0xe40 [virtio_blk]
[<0>] virtio_dev_probe+0x1b0/0x270
[<0>] really_probe+0x19b/0x3e0
[<0>] __driver_probe_device+0x78/0x160
[<0>] driver_probe_device+0x1f/0x90
[<0>] __driver_attach+0xd2/0x1c0
[<0>] bus_for_each_dev+0x85/0xd0
[<0>] bus_add_driver+0x116/0x220
[<0>] driver_register+0x59/0x100
[<0>] virtio_blk_init+0x4e/0xff0 [virtio_blk]
[<0>] do_one_initcall+0x5a/0x320
[<0>] do_init_module+0x60/0x240
[<0>] __do_sys_init_module+0x17f/0x1b0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
497:
[<0>] blkdev_get_by_dev.part.0+0x134/0x320
[<0>] blkdev_open+0x47/0xb0
[<0>] do_dentry_open+0x200/0x500
[<0>] path_openat+0xafe/0x1160
[<0>] do_filp_open+0xb3/0x160
[<0>] do_sys_openat2+0xab/0xe0
[<0>] __x64_sys_openat+0x57/0xa0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
501:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
502:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[root@vhostfedora-cloud-23052 ~]#
[root@vhostfedora-cloud-23052 ~]# jobs
[1] Running ( read -rn1 < /dev/vda ) &
[2]- Running reboot -f &
[3]+ Running reboot -f &
[root@vhostfedora-cloud-23052 ~]# for p in $(jobs -p); do
> echo "$p:"; cat "/proc/$p/stack"; done
497:
[<0>] blkdev_get_by_dev.part.0+0x134/0x320
[<0>] blkdev_open+0x47/0xb0
[<0>] do_dentry_open+0x200/0x500
[<0>] path_openat+0xafe/0x1160
[<0>] do_filp_open+0xb3/0x160
[<0>] do_sys_openat2+0xab/0xe0
[<0>] __x64_sys_openat+0x57/0xa0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
501:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
502:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[root@vhostfedora-cloud-23052 ~]#
Even an attempt to reboot|shutdown the kernel through sysrq
blocks forever.
I am really not sure what's at fault here. Granted, vhost
shouldn't crash at all, but the very same thing happens when vhost
is simply terminated while the VM is still running. Is it qemu's fault? Is kernel inside the VM not kicked with some crucial updates regarding the state of the devices? Is it kernel's fault of not handling something properly down the stack? Is it both? Testing with latest qemu and newer kernel (6.9.4) yields exactly same results when vhost
suddenly disappears.
Under CI, we have a separate timer-based job which executes the above test suite. Recently, we have been seeing high rate of failures in the blk(virtio-pci) variant. In this particular case, it's caused by a
vhost
crash which usually happens atblk_hotremove_tc2()
, when the VM is being rebooted right after theHotInNvme0
ctrl gets removed.Essentially, the suite boils down to:
Crash under plain SPDK build (no debug, no asan|ubsan, etc. This is what the actual job is using) looks like this:
This of course is the case since
manual.sh -> scsi_hotplug.sh -> blk_removal.sh
is executed standalone, hence there's no proper core setup in place. However, with the extra setup we get an actual core:with asan|ubsan we also get:
Just a note about the state of the VM after the crash happens. When the
HotInNvme0
is removed, the corresponding block device (attached to virtio-pci) is still present but any type of io operation immediately fails. This part may be expected.However, when the
vhost
suddenly disappears due to a crash, this virtio device is still present inside the VM and the entire userspace gets feverish - especially during reboot,udev-worker
threads, get stuck in the kernel almost immediately whensystemd-udevd
kicks in (I can attach traces from the kernel if anyone is interested where exactly they end up). Any other tooling which is executed frominitramfs
also blocks indefinitely - e.g.lvm
. All this eventually leads to a boot timeout triggered fromvm_wait_for_boot()
(since corresponding services are completely blocked).This begs the question about overall interaction between
vhost
and the VM since this behavior is extremely similar to issues like #3322, #3344, #3264 and #3392 - I am not saying it's exactly the same but it's clear that thishotplug
suite is uncovering some more serious problem.That said, based on the traces, I have been looking at 1c05f3fb0a which was quite recently merged (a month ago, roughly when we started seeing increase in failure rate) - after reverting this change, issue disappears (or at least frequency drops to a level where it seems like the issue is gone).