[blk_hotremove.sh] vhost hits segfault/heap-use-after-free under ASAN

mikeBashStuff commented 2 weeks ago

SPDK: latest
Test suite: test/vhost/hotplug

Under CI, we have a separate timer-based job which executes the above test suite. Recently, we have been seeing high rate of failures in the blk(virtio-pci) variant. In this particular case, it's caused by a vhost crash which usually happens at blk_hotremove_tc2(), when the VM is being rebooted right after the HotInNvme0 ctrl gets removed.

Essentially, the suite boils down to:

# fire up vhost, cpu0
$rpc_py bdev_nvme_set_hotplug -e
$rpc_py bdev_split_create Nvme0n1 16
$rpc_py bdev_malloc_create 128 512 -b Malloc
$rpc_py bdev_split_create Malloc 4
$rpc_py bdev_split_create HotInNvme0n1 2
$rpc_py bdev_split_create HotInNvme1n1 2
$rpc_py bdev_split_create HotInNvme2n1 2
$rpc_py bdev_split_create HotInNvme3n1 2
$rpc_py bdev_get_bdevs

# Setup two VMs, each with two vhost_blk drives:
vm0:Nvme0n1p0:Nvme0n1p1 cpu1,cpu2
vm1:Nvme0n1p2:Nvme0n1p3 cpu3,cpu4

# tests start here
blk_hotremove_tc1 -> this detaches the nvme ctrl, then adds it back as HotInNvme0
blk_hotremove_tc2 -> this where 95% of failures happens:

        $rpc_py vhost_create_blk_controller naa.Nvme0n1p0.0 HotInNvme0n1p0
        $rpc_py vhost_create_blk_controller naa.Nvme0n1p1.0 Mallocp0
        $rpc_py vhost_create_blk_controller naa.Nvme0n1p2.1 Mallocp1
        $rpc_py vhost_create_blk_controller naa.Nvme0n1p3.1 Mallocp2

       # Boot vm0,vm1, prep fio server
       # start fio workload against vm0 (Nvme0n1p0:Nvme0n1p1 -> /dev/vd{a,b}):
           [global]
            blocksize=4k
            iodepth=512
            iodepth_batch=128
            iodepth_low=256
            ioengine=libaio
            group_reporting
            thread
            numjobs=1
            direct=1
            rw=randwrite
            do_verify=1
            verify=md5
            verify_backlog=1024
            time_based=1
            runtime=10
            verify_state_save=0

        # After 3s, HotInNvme0 is detached
        # Tests waits for fio to terminate - it assumes workload failed due to the above removal
        # reboot of vm0 is called 
        # vhost crashes, VM hangs

Crash under plain SPDK build (no debug, no asan|ubsan, etc. This is what the actual job is using) looks like this:

[Thu Jun 20 14:13:53 2024] dpdk-vhost-evt[742863]: segfault at b9 ip 00000000004d0c12 sp 00007f2b31bfe708 error 4 in vhost[407000+28c000] likely on CPU 6 (core 6, socket 0)
[Thu Jun 20 14:13:53 2024] Code: 04 01 00 00 29 d0 8b 57 38 23 47 34 39 d0 48 0f 47 c2 c3 0f 1f 44 00 00 41 57 49 89 d0 49 89 c9 41 56 41 55 41 54 55 53 89 d3 <8b> 97 88 00 00 00 83 fa 02 0f 84 af 01 00 00 0f 87 01 01 00 00 85

This of course is the case since manual.sh -> scsi_hotplug.sh -> blk_removal.sh is executed standalone, hence there's no proper core setup in place. However, with the extra setup we get an actual core:

Core was generated by `/root/spdk/build/bin/vhost -r /root/vhost_test/vhost/0/rpc.sock -S /root/vhost_'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  rte_ring_enqueue_bulk_elem (free_space=0x0, n=1, esize=8, obj_table=0x7f2b31bfe740, r=0x31) at /root/spdk/dpdk/build/include/rte_ring_elem.h:197
197             switch (r->prod.sync_type) {
[Current thread is 1 (Thread 0x7f2b31c006c0 (LWP 742863))]
Thread 3 (Thread 0x7f2b339ab9c0 (LWP 742747)):
#0  0x00007f2b33ad4e5b in munmap () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000000000055f651 in rte_mem_unmap ()
No symbol table info available.
#2  0x00000000005309b8 in pci_unmap_resource ()
No symbol table info available.
#3  0x000000000040a7de in find_and_unmap_vfio_resource.isra[cold] ()
No symbol table info available.
#4  0x0000000000534496 in pci_vfio_unmap_resource_primary ()
No symbol table info available.
#5  0x0000000000409a51 in pci_probe[cold] ()
No symbol table info available.
#6  0x0000000000540d83 in rte_bus_probe ()
No symbol table info available.
#7  0x00000000004d4a39 in spdk_pci_enumerate (driver=0x20e1db0, enum_cb=enum_cb@entry=0x48d170 <pcie_nvme_enum_cb>, enum_ctx=enum_ctx@entry=0x7ffe837434a0) at pci.c:713
        dev = 0x0
        rc = <optimized out>
#8  0x000000000048e0e1 in nvme_pcie_ctrlr_scan (probe_ctx=<optimized out>, direct_connect=<optimized out>) at nvme_pcie.c:881
        enum_ctx = {probe_ctx = 0x24e5b30, pci_addr = {domain = 0, bus = 0 '\000', dev = 0 '\000', func = 0 '\000'}, has_pci_addr = false}
#9  0x00000000004924f0 in nvme_probe_internal (probe_ctx=probe_ctx@entry=0x24e5b30, direct_connect=direct_connect@entry=false) at nvme.c:819
        rc = <optimized out>
        ctrlr = <optimized out>
        ctrlr_tmp = <optimized out>
        __func__ = "nvme_probe_internal"
#10 0x000000000049291b in spdk_nvme_probe_async (trid=trid@entry=0x7ffe83743570, cb_ctx=cb_ctx@entry=0x0, probe_cb=probe_cb@entry=0x4336a0 <hotplug_probe_cb>, attach_cb=attach_cb@entry=0x43c7a0 <attach_cb>, remove_cb=remove_cb@entry=0x0) at nvme.c:1505
        rc = <optimized out>
        probe_ctx = 0x24e5b30
#11 0x00000000004338ca in bdev_nvme_hotplug (arg=<optimized out>) at bdev_nvme.c:5674
        trid_pcie = {trstring = "PCIE", '\000' <repeats 28 times>, trtype = SPDK_NVME_TRANSPORT_PCIE, adrfam = 0, traddr = '\000' <repeats 256 times>, trsvcid = '\000' <repeats 32 times>, subnqn = '\000' <repeats 223 times>, priority = 0}
#12 0x00000000005175f7 in thread_execute_timed_poller (now=<optimized out>, poller=0x21b0880, thread=<optimized out>) at thread.c:1020
        rc = <optimized out>
        rc = <optimized out>
        __func__ = "thread_execute_timed_poller"
#13 thread_poll (thread=thread@entry=0x21872c0, max_msgs=max_msgs@entry=0, now=now@entry=6461928258306465) at thread.c:1110
        timer_rc = 0
        msg_count = <optimized out>
        poller = 0x21b0880
        tmp = 0x218f7f0
        critical_msg = <optimized out>
        rc = 0
#14 0x000000000051839f in spdk_thread_poll (thread=thread@entry=0x21872c0, max_msgs=max_msgs@entry=0, now=6461928258306465) at thread.c:1173
        orig_thread = 0x0
        rc = <optimized out>
#15 0x00000000004eac71 in _reactor_run (reactor=0x2186c80) at reactor.c:906
        thread = 0x21872c0
        tmp = 0x234c858
        lw_thread = 0x2187608
        now = <optimized out>
        rc = <optimized out>
        thread = <optimized out>
        lw_thread = <optimized out>
        tmp = <optimized out>
        now = <optimized out>
        rc = <optimized out>
#16 reactor_run (arg=0x2186c80) at reactor.c:944
        reactor = 0x2186c80
        thread = <optimized out>
        lw_thread = <optimized out>
        tmp = <optimized out>
        thread_name = "reactor_0", '\000' <repeats 15 times>, "\203uk\000\000\000\000"
        last_sched = 0
        __func__ = "reactor_run"
#17 0x00000000004eb103 in spdk_reactors_start () at reactor.c:1060
        reactor = <optimized out>
        i = <optimized out>
        current_core = 0
        rc = <optimized out>
        __func__ = "spdk_reactors_start"
#18 0x00000000004e829e in spdk_app_start (opts_user=opts_user@entry=0x7ffe83743e30, start_fn=start_fn@entry=0x42e990 <vhost_started>, arg1=arg1@entry=0x0) at app.c:980
        rc = <optimized out>
        tty = <optimized out>
        tmp_cpumask = {str = '\000' <repeats 256 times>, cpus = "\001", '\000' <repeats 126 times>}
        g_env_was_setup = false
        opts_local = {name = 0x6930cb "vhost", json_config_file = 0x0, json_config_ignore_errors = false, reserved17 = "\000\000\000\000\000\000", rpc_addr = 0x7ffe837459ec "/root/vhost_test/vhost/0/rpc.sock", reactor_mask = 0x6b3880 "0x1", tpoint_group_mask = 0x0, shm_id = -1, reserved52 = "\000\000\000", shutdown_cb = 0x0, enable_coredump = true, reserved65 = "\000\000", mem_channel = -1, main_core = -1, mem_size = -1, no_pci = false, hugepage_single_segments = false, unlink_hugepage = false, no_huge = false, reserved84 = "\000\000\000", hugedir = 0x0, print_level = SPDK_LOG_INFO, reserved100 = "\000\000\000", num_pci_addr = 0, pci_blocked = 0x0, pci_allowed = 0x0, iova_mode = 0x0, delay_subsystem_init = true, reserved137 = "\000\000\000\000\000\000", num_entries = 32768, env_context = 0x0, log = 0x0, base_virtaddr = 35184372088832, opts_size = 252, disable_signal_handlers = false, interrupt_mode = false, reserved186 = "\000\000\000\000\000", msg_mempool_size = 262143, rpc_allowlist = 0x0, vf_token = 0x0, lcore_map = 0x0, rpc_log_level = SPDK_LOG_DISABLED, rpc_log_file = 0x0, json_data = 0x0, json_data_size = 0}
        opts = 0x7ffe83743b10
        i = <optimized out>
        core = <optimized out>
        __func__ = "spdk_app_start"
#19 0x0000000000426c71 in main (argc=<optimized out>, argv=<optimized out>) at vhost.c:77
        opts = {name = 0x6930cb "vhost", json_config_file = 0x0, json_config_ignore_errors = false, reserved17 = "\000\000\000\000\000\000", rpc_addr = 0x7ffe837459ec "/root/vhost_test/vhost/0/rpc.sock", reactor_mask = 0x0, tpoint_group_mask = 0x0, shm_id = -1, reserved52 = "\000\000\000", shutdown_cb = 0x0, enable_coredump = true, reserved65 = "\000\000", mem_channel = -1, main_core = -1, mem_size = -1, no_pci = false, hugepage_single_segments = false, unlink_hugepage = false, no_huge = false, reserved84 = "\000\000\000", hugedir = 0x0, print_level = SPDK_LOG_INFO, reserved100 = "\000\000\000", num_pci_addr = 0, pci_blocked = 0x0, pci_allowed = 0x0, iova_mode = 0x0, delay_subsystem_init = true, reserved137 = "\000\000\000\000\000\000", num_entries = 32768, env_context = 0x0, log = 0x0, base_virtaddr = 35184372088832, opts_size = 252, disable_signal_handlers = false, interrupt_mode = false, reserved186 = "\000\000\000\000\000", msg_mempool_size = 0, rpc_allowlist = 0x0, vf_token = 0x0, lcore_map = 0x0, rpc_log_level = SPDK_LOG_DISABLED, rpc_log_file = 0x0, json_data = 0x0, json_data_size = 0}
        rc = <optimized out>
Thread 2 (Thread 0x7f2b338006c0 (LWP 742749)):
#0  0x00007f2b33adcd72 in epoll_wait () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x0000000000563a64 in eal_intr_thread_main ()
No symbol table info available.
#2  0x000000000055fe6c in thread_start_wrapper ()
No symbol table info available.
#3  0x00007f2b33a56947 in start_thread () from /usr/lib64/libc.so.6
No symbol table info available.
#4  0x00007f2b33adc970 in clone3 () from /usr/lib64/libc.so.6
No symbol table info available.
Thread 1 (Thread 0x7f2b31c006c0 (LWP 742863)):
#0  rte_ring_enqueue_bulk_elem (free_space=0x0, n=1, esize=8, obj_table=0x7f2b31bfe740, r=0x31) at /root/spdk/dpdk/build/include/rte_ring_elem.h:197
No locals.
#1  rte_ring_enqueue_bulk (free_space=0x0, n=1, obj_table=0x7f2b31bfe740, r=0x31) at /root/spdk/dpdk/build/include/rte_ring.h:279
No locals.
#2  spdk_ring_enqueue (ring=0x31, objs=objs@entry=0x7f2b31bfe740, count=count@entry=1, free_space=free_space@entry=0x0) at env.c:410
No locals.
#3  0x000000000051a79c in spdk_thread_send_msg (thread=0x234cf10, fn=fn@entry=0x4e1880 <add_vq_to_poll_group>, ctx=0x7f2b24040610) at thread.c:1391
        local_thread = <optimized out>
        msg = 0x2000042d06c0
        rc = <optimized out>
        __func__ = "spdk_thread_send_msg"
#4  0x00000000004e07b2 in vhost_blk_vq_enable (vsession=0x7f2b24000bc0, vq=0x7f2b24000d40) at vhost_blk.c:1214
        bvsession = 0x7f2b24000bc0
        vdev = <optimized out>
        user_dev = <optimized out>
        vq_info = <optimized out>
        __func__ = "vhost_blk_vq_enable"
#5  0x00000000004e3d7d in enable_device_vq (vsession=vsession@entry=0x7f2b24000bc0, qid=1) at rte_vhost_user.c:1116
        q = 0x7f2b24000d40
        packed_ring = <optimized out>
        backend = 0x7738a0 <vhost_blk_user_device_backend>
        rc = <optimized out>
#6  0x00000000004e5b0e in extern_vhost_post_msg_handler (vid=0, _msg=0x7f2b31bfe860) at rte_vhost_user.c:1547
        msg = 0x7f2b31bfe860
        vsession = 0x7f2b24000bc0
        user_dev = 0x25e6a60
        qid = <optimized out>
        rc = <optimized out>
        __func__ = "extern_vhost_post_msg_handler"
#7  0x00000000004236df in vhost_user_msg_handler[cold] ()
No symbol table info available.
#8  0x00000000005c976d in vhost_user_read_cb ()
No symbol table info available.
#9  0x00000000005c8616 in fdset_event_dispatch ()
No symbol table info available.
#10 0x000000000055fe6c in thread_start_wrapper ()
No symbol table info available.
#11 0x00007f2b33a56947 in start_thread () from /usr/lib64/libc.so.6
No symbol table info available.
#12 0x00007f2b33adc970 in clone3 () from /usr/lib64/libc.so.6
No symbol table info available.

with asan|ubsan we also get:

==776656==ERROR: AddressSanitizer: heap-use-after-free on address 0x607000006bd0 at pc 0x00000089319a bp 0x7fcd687fe470 sp 0x7fcd687fe468
READ of size 8 at 0x607000006bd0 thread T2
    #0 0x893199 in vhost_blk_vq_enable /root/spdk/lib/vhost/vhost_blk.c:1214
    #1 0x8a4fb9 in enable_device_vq /root/spdk/lib/vhost/rte_vhost_user.c:1116
    #2 0x8b2349 in extern_vhost_post_msg_handler /root/spdk/lib/vhost/rte_vhost_user.c:1547
    #3 0x452139 in vhost_user_msg_handler.cold (/root/spdk/build/bin/vhost+0x452139) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #4 0xc301a2 in vhost_user_read_cb (/root/spdk/build/bin/vhost+0xc301a2) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #5 0xc2c67a in fdset_event_dispatch (/root/spdk/build/bin/vhost+0xc2c67a) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #6 0xb01dda in thread_start_wrapper (/root/spdk/build/bin/vhost+0xb01dda) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #7 0x7fcd6f5b0946 in start_thread (/usr/lib64/libc.so.6+0x8c946) (BuildId: 9148cab1b932d44ef70e306e9c02ee38d06cad51)
    #8 0x7fcd6f63696f in __clone3 (/usr/lib64/libc.so.6+0x11296f) (BuildId: 9148cab1b932d44ef70e306e9c02ee38d06cad51)

0x607000006bd0 is located 16 bytes inside of 80-byte region [0x607000006bc0,0x607000006c10)
freed by thread T0 (reactor_0) here:
    #0 0x7fcd703f8fb8 in __interceptor_free.part.0 (/usr/lib64/libasan.so.8+0xd7fb8) (BuildId: 542ad02088f38edfdba9d4bfa465b2299f512d3e)
    #1 0x896230 in destroy_session_poller_cb /root/spdk/lib/vhost/vhost_blk.c:1698

previously allocated by thread T0 (reactor_0) here:
    #0 0x7fcd703f9cc7 in calloc (/usr/lib64/libasan.so.8+0xd8cc7) (BuildId: 542ad02088f38edfdba9d4bfa465b2299f512d3e)
    #1 0x897634 in session_start_poll_groups /root/spdk/lib/vhost/vhost_blk.c:1504
    #2 0x897634 in vhost_blk_start /root/spdk/lib/vhost/vhost_blk.c:1596

Thread T2 created by T0 (reactor_0) here:
    #0 0x7fcd70369956 in pthread_create (/usr/lib64/libasan.so.8+0x48956) (BuildId: 542ad02088f38edfdba9d4bfa465b2299f512d3e)
    #1 0xb0112c in rte_thread_create (/root/spdk/build/bin/vhost+0xb0112c) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #2 0xad8782 in rte_thread_create_control (/root/spdk/build/bin/vhost+0xad8782) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #3 0xad8ea0 in rte_thread_create_internal_control (/root/spdk/build/bin/vhost+0xad8ea0) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #4 0xc337a3 in rte_vhost_driver_start (/root/spdk/build/bin/vhost+0xc337a3) (BuildId: 64614f9371c8f2a08d6a81bcd2167dc9b0a9c23b)
    #5 0x8b13d2 in vhost_register_unix_socket /root/spdk/lib/vhost/rte_vhost_user.c:1650
    #6 0x8b4467 in vhost_user_dev_create /root/spdk/lib/vhost/rte_vhost_user.c:1839

SUMMARY: AddressSanitizer: heap-use-after-free /root/spdk/lib/vhost/vhost_blk.c:1214 in vhost_blk_vq_enable
Shadow bytes around the buggy address:
  0x607000006900: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fa fa
  0x607000006980: fa fa fd fd fd fd fd fd fd fd fd fd fa fa fa fa
  0x607000006a00: fd fd fd fd fd fd fd fd fd fd fa fa fa fa fd fd
  0x607000006a80: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd
  0x607000006b00: fd fd fd fd fd fd fa fa fa fa fd fd fd fd fd fd
=>0x607000006b80: fd fd fd fd fa fa fa fa fd fd[fd]fd fd fd fd fd
  0x607000006c00: fd fd fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x607000006c80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x607000006d00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x607000006d80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x607000006e00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==776656==ABORTING

Just a note about the state of the VM after the crash happens. When the HotInNvme0 is removed, the corresponding block device (attached to virtio-pci) is still present but any type of io operation immediately fails. This part may be expected.

However, when the vhost suddenly disappears due to a crash, this virtio device is still present inside the VM and the entire userspace gets feverish - especially during reboot, udev-worker threads, get stuck in the kernel almost immediately when systemd-udevd kicks in (I can attach traces from the kernel if anyone is interested where exactly they end up). Any other tooling which is executed from initramfs also blocks indefinitely - e.g. lvm. All this eventually leads to a boot timeout triggered from vm_wait_for_boot() (since corresponding services are completely blocked).

This begs the question about overall interaction between vhost and the VM since this behavior is extremely similar to issues like #3322, #3344, #3264 and #3392 - I am not saying it's exactly the same but it's clear that this hotplug suite is uncovering some more serious problem.

That said, based on the traces, I have been looking at 1c05f3fb0a which was quite recently merged (a month ago, roughly when we started seeing increase in failure rate) - after reverting this change, issue disappears (or at least frequency drops to a level where it seems like the issue is gone).

mikeBashStuff commented 2 weeks ago

What's confusing to me is this:

[Thu Jun 20 14:13:53 2024] dpdk-vhost-evt[742863]: segfault at b9 ip 00000000004d0c12 sp 00007f2b31bfe708 error 4 in vhost[407000+28c000] likely on CPU 6 (core 6, socket 0)

says that the dpdk-vhost-evt thread got hit while spinning on cpu6. This is weird, since vhost app is told to execute under cpu0 only. Is this somehow ignored? Or is kernel simply mistaken? I haven't tracked these threads to see what cpu affinity they are actually running with, but still, quite surprising.

jimharris commented 1 week ago

What's confusing to me is this:
[Thu Jun 20 14:13:53 2024] dpdk-vhost-evt[742863]: segfault at b9 ip 00000000004d0c12 sp 00007f2b31bfe708 error 4 in vhost[407000+28c000] likely on CPU 6 (core 6, socket 0)
says that the dpdk-vhost-evt thread got hit while spinning on cpu6. This is weird, since vhost app is told to execute under cpu0 only. Is this somehow ignored? Or is kernel simply mistaken? I haven't tracked these threads to see what cpu affinity they are actually running with, but still, quite surprising.

DPDK processes vhost events on a separate pthread - so this is expected for that dpdk-vhost-evt thread.

mikeBashStuff commented 1 week ago

Some additional info regarding the state of the VM:

When vhost crashes, the vhost-user-blk-pci part from the QEMU side does attempt to close all the chardevs that were previously connected. However, even though they are considered as "disconnected" they are still busy and cannot be removed:

char_Nvme0n1p0: filename=disconnected:unix:/root/vhost_test/vhost/0/naa.Nvme0n1p0.0
char_Nvme0n1p1: filename=disconnected:unix:/root/vhost_test/vhost/0/naa.Nvme0n1p1.0
seabios: filename=file
compat_monitor0: filename=telnet:127.0.0.1:10002,server=on <-> 127.0.0.1:38708
serial0: filename=pty:/dev/pts/2
(qemu) char
chardev-add         chardev-change      chardev-remove      chardev-send-break
(qemu) chardev-remove
char_Nvme0n1p0   char_Nvme0n1p1   compat_monitor0  parallel0
seabios          serial0
(qemu) chardev-remove ch
char_Nvme0n1p0  char_Nvme0n1p1
(qemu) chardev-remove char_Nvme0n1p0
Error: Chardev 'char_Nvme0n1p0' is busy
(qemu) chardev-remove char_Nvme0n1p1
Error: Chardev 'char_Nvme0n1p1' is busy
(qemu)

This I guess is sort of expected, as at this point kernel inside the guest VM still sees related block devices, but an attempt to perform any sort of operation against these devices leaves given process stuck inside the kernel - as mentioned in the initial report, the system inside the system inside the VM is essentially dead. There's a chance to somewhat operate inside the VM when the boot sequence is kept to very minimal (e.g. by dropping initrd). In such a case, it's possible to observe where kernel is getting stuck. E.g.:

[root@vhostfedora-cloud-23052 ~]#  echo w >/proc/sysrq-trigger
[root@vhostfedora-cloud-23052 ~]# for p  in \
> $(dmesg | sed -n 's/.*[^p]pid:\([0-9]\+\).*/\1/p'); do
> echo "$p:"; cat "/proc/$p/stack"; done
254:
[<0>] folio_wait_bit_common+0x13d/0x350
[<0>] do_read_cache_folio+0x12a/0x190
[<0>] read_part_sector+0x36/0xb0
[<0>] sgi_partition+0x3f/0x350
[<0>] bdev_disk_changed+0x2aa/0x700
[<0>] blkdev_get_whole+0x7a/0x90
[<0>] blkdev_get_by_dev.part.0+0x174/0x320
[<0>] disk_scan_partitions+0x69/0xe0
[<0>] device_add_disk+0x3bb/0x3c0
[<0>] virtblk_probe+0x8a2/0xe40 [virtio_blk]
[<0>] virtio_dev_probe+0x1b0/0x270
[<0>] really_probe+0x19b/0x3e0
[<0>] __driver_probe_device+0x78/0x160
[<0>] driver_probe_device+0x1f/0x90
[<0>] __driver_attach+0xd2/0x1c0
[<0>] bus_for_each_dev+0x85/0xd0
[<0>] bus_add_driver+0x116/0x220
[<0>] driver_register+0x59/0x100
[<0>] virtio_blk_init+0x4e/0xff0 [virtio_blk]
[<0>] do_one_initcall+0x5a/0x320
[<0>] do_init_module+0x60/0x240
[<0>] __do_sys_init_module+0x17f/0x1b0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
497:
[<0>] blkdev_get_by_dev.part.0+0x134/0x320
[<0>] blkdev_open+0x47/0xb0
[<0>] do_dentry_open+0x200/0x500
[<0>] path_openat+0xafe/0x1160
[<0>] do_filp_open+0xb3/0x160
[<0>] do_sys_openat2+0xab/0xe0
[<0>] __x64_sys_openat+0x57/0xa0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
501:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
502:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[root@vhostfedora-cloud-23052 ~]#
[root@vhostfedora-cloud-23052 ~]# jobs
[1]   Running                 ( read -rn1 < /dev/vda ) &
[2]-  Running                 reboot -f &
[3]+  Running                 reboot -f &
[root@vhostfedora-cloud-23052 ~]# for p in $(jobs -p); do
> echo "$p:"; cat "/proc/$p/stack"; done
497:
[<0>] blkdev_get_by_dev.part.0+0x134/0x320
[<0>] blkdev_open+0x47/0xb0
[<0>] do_dentry_open+0x200/0x500
[<0>] path_openat+0xafe/0x1160
[<0>] do_filp_open+0xb3/0x160
[<0>] do_sys_openat2+0xab/0xe0
[<0>] __x64_sys_openat+0x57/0xa0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
501:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
502:
[<0>] sync_bdevs+0x92/0x160
[<0>] ksys_sync+0x6b/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[root@vhostfedora-cloud-23052 ~]#

Even an attempt to reboot|shutdown the kernel through sysrq blocks forever.

I am really not sure what's at fault here. Granted, vhost shouldn't crash at all, but the very same thing happens when vhost is simply terminated while the VM is still running. Is it qemu's fault? Is kernel inside the VM not kicked with some crucial updates regarding the state of the devices? Is it kernel's fault of not handling something properly down the stack? Is it both? Testing with latest qemu and newer kernel (6.9.4) yields exactly same results when vhost suddenly disappears.

spdk / spdk

[blk_hotremove.sh] vhost hits segfault/heap-use-after-free under ASAN #3416