spdk / spdk

Storage Performance Development Kit
https://spdk.io/
Other
3.06k stars 1.2k forks source link

vfio-user: Memory region register / unregister failed when restarting Qemu VM #1922

Closed karlatec closed 3 years ago

karlatec commented 3 years ago

Current Behavior

After restarting Qemu VM which attaches to vfio-user socket there's a bunch of errors displayed:

[2021-04-28 12:44:59.578231] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3eae00000-0x7fe3eb000000 failed
[2021-04-28 12:44:59.578393] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3aae00000-0x7fe3eae00000 failed
[2021-04-28 12:44:59.595599] vfio_user.c:1177:memory_region_remove_cb: *ERROR*: Memory region unregister 0x7fe3aae00000-0x7fe3eae00000 failed
[2021-04-28 12:44:59.595946] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3aac00000-0x7fe3eac00000 failed
[2021-04-28 12:44:59.597287] vfio_user.c:1177:memory_region_remove_cb: *ERROR*: Memory region unregister 0x7fe3aac00000-0x7fe3eac00000 failed
[2021-04-28 12:44:59.597385] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3aae00000-0x7fe3eae00000 failed
[2021-04-28 12:44:59.618455] vfio_user.c:1177:memory_region_remove_cb: *ERROR*: Memory region unregister 0x7fe3aae00000-0x7fe3eae00000 failed
[2021-04-28 12:44:59.618537] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3ab000000-0x7fe3eb000000 failed
[2021-04-28 12:44:59.619627] vfio_user.c:1177:memory_region_remove_cb: *ERROR*: Memory region unregister 0x7fe3ab000000-0x7fe3eb000000 failed
[2021-04-28 12:44:59.619861] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3aae00000-0x7fe3eae00000 failed
[2021-04-28 12:44:59.677225] vfio_user.c:1177:memory_region_remove_cb: *ERROR*: Memory region unregister 0x7fe3aae00000-0x7fe3eae00000 failed
[2021-04-28 12:44:59.677397] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3eaa00000-0x7fe3eac00000 failed
[2021-04-28 12:44:59.677704] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3aa400000-0x7fe3ea400000 failed
[2021-04-28 12:44:59.678955] vfio_user.c:1104:memory_region_add_cb: *ERROR*: Memory region register 0x7fe3aa200000-0x7fe3aa400000 failed

Steps to Reproduce

  1. Build SPDK
    ./configure --with-fio=/usr/src/fio --with-vfio-user --with-rdma --enable-debug
  2. Run SPDK target and configure to use vfio-user transport:
    
    #!/usr/bin/env bash

i=1

rm -rf /var/run/muser rm -rf /dev/shm/muser

mkdir -p /var/run/muser mkdir -p /var/run/muser/iommu_group mkdir -p /var/run/muser/domain/muser$i/$i mkdir -p /dev/shm/muser/muser$i sleep 1

build/bin/nvmf_tgt -m [12] & pid=$! echo "PID: $pid" sleep 3

scripts/rpc.py bdev_nvme_attach_controller -b Nvme0 -t pcie -a 0000:0b:00.0 scripts/rpc.py nvmf_create_transport --trtype VFIOUSER scripts/rpc.py nvmf_create_subsystem nqn.2019-07.io.spdk:cnode0 -s SPDK001 -a scripts/rpc.py nvmf_subsystem_add_ns nqn.2019-07.io.spdk:cnode0 Nvme0n1 scripts/rpc.py nvmf_subsystem_add_listener nqn.2019-07.io.spdk:cnode0 -t VFIOUSER -a /var/run/muser/domain/muser$i/$i -s 0 sleep 1

ln -s /var/run/muser/domain/muser$i/$i /var/run/muser/domain/muser$i/$i/iommu_group ln -s /var/run/muser/domain/muser$i/$i /var/run/muser/iommu_group/$i ln -s /var/run/muser/domain/muser$i/$i/bar0 /dev/shm/muser/muser$i/bar0

3. Run the VM:

taskset -a -c 1-2 /home/klateck/work/qemu-vfiouser/build/qemu-system-x86_64 -m 1024 --enable-kvm \ -cpu host -smp 2 -vga std -vnc :100 -daemonize \ -object memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on,prealloc=yes,host-nodes=0,policy=bind \ -snapshot -monitor telnet:127.0.0.1:10002,server,nowait \ -numa node,memdev=mem \ -pidfile /home/klateck/vhost_test/vms/0/qemu.pid \ -serial file:/home/klateck/vhost_test/vms/0/serial.log \ -D /home/klateck/vhost_test/vms/0/qemu.log \ -chardev file,path=/home/klateck/vhost_test/vms/0/seabios.log,id=seabios \ -device isa-debugcon,iobase=0x402,chardev=seabios \ -net user,hostfwd=tcp::10000-:22,hostfwd=tcp::10001-:8765 \ -net nic -drive file=/home/sys_sgci/spdk_dependencies/spdk_test_image.qcow2,if=none,id=os_disk \ -device ide-hd,drive=os_disk,bootindex=0 \ -device vfio-user-pci,socket=/var/run/muser/domain/muser1/1/cntrl

4. Log in to VM, check attached bdev is visible and shut down the VM:

[root@vhost32-cloud-12806 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3G 0 disk └─sda1 8:1 0 3G 0 part / nvme0n1 259:1 0 372.6G 0 disk [root@vhost32-cloud-12806 ~]# sudo poweroff Connection to 127.0.0.1 closed by remote host. Connection to 127.0.0.1 closed.

5. Start the VM with the same command as in point 3. Error messages from SPDK are displayed:

[2021-04-28 12:57:43.651374] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3aae00000-0x7fe3eae00000 failed [2021-04-28 12:57:43.668633] vfio_user.c:1177:memory_region_remove_cb: ERROR: Memory region unregister 0x7fe3aae00000-0x7fe3eae00000 failed [2021-04-28 12:57:43.669019] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3aac00000-0x7fe3eac00000 failed [2021-04-28 12:57:43.670388] vfio_user.c:1177:memory_region_remove_cb: ERROR: Memory region unregister 0x7fe3aac00000-0x7fe3eac00000 failed [2021-04-28 12:57:43.670464] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3aae00000-0x7fe3eae00000 failed [2021-04-28 12:57:43.691523] vfio_user.c:1177:memory_region_remove_cb: ERROR: Memory region unregister 0x7fe3aae00000-0x7fe3eae00000 failed [2021-04-28 12:57:43.691602] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3ab000000-0x7fe3eb000000 failed [2021-04-28 12:57:43.692692] vfio_user.c:1177:memory_region_remove_cb: ERROR: Memory region unregister 0x7fe3ab000000-0x7fe3eb000000 failed [2021-04-28 12:57:43.692948] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3aae00000-0x7fe3eae00000 failed [2021-04-28 12:57:43.750532] vfio_user.c:1177:memory_region_remove_cb: ERROR: Memory region unregister 0x7fe3aae00000-0x7fe3eae00000 failed [2021-04-28 12:57:43.750691] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3eaa00000-0x7fe3eac00000 failed [2021-04-28 12:57:43.751007] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3aa400000-0x7fe3ea400000 failed [2021-04-28 12:57:43.752075] vfio_user.c:1104:memory_region_add_cb: ERROR: Memory region register 0x7fe3aa200000-0x7fe3aa400000 failed



## Context (Environment including OS version, SPDK version, etc.)

SPDK `9ed384dab test/nvme_pcie: cases for building PRP and SGL request`
vfio-user submodule `3acb974  vfu_realize_ctx(): fix default PCI config space region (#445)`
gcc (GCC) 10.2.1 20201125 (Red Hat 10.2.1-9)
Qemu `tmakatos/qemu.git` branch `vfio-user-v0.6` commit `89ff714f4b set argsz in device get info`
changpe1 commented 3 years ago

Soft reboot with a physical NVMe device attached will always cause some failures, there are some issues that we didn't figure out in QEMU/libvfio-user.

changpe1 commented 3 years ago

Here is the error case I met when using a physical NVMe device as the backend:

qemu-system-x86_64: kvm_set_user_memory_region: KVM_SET_USER_MEMORY_REGION failed, slot=4, start=0xc0000, size=0xbff40000: File exists kvm_set_phys_mem: error registering slot: File exists Aborted (core dumped)

changpe1 commented 3 years ago

link this issue to https://github.com/nutanix/libvfio-user/issues/439, we will also track it there.

changpe1 commented 3 years ago

Here is the patch to fix it https://review.spdk.io/gerrit/c/spdk/spdk/+/7689, we need to add device reset support in SPDK and take care of the memory regions that registered to SPDK.

tmakatos commented 3 years ago

Unregistering the DMA regions might be something libvfio-user should do, I've started a discussion on Slack.

changpe1 commented 3 years ago

I updated the submodule via https://review.spdk.io/gerrit/c/spdk/spdk/+/7831, I tested and this issue has been fixed.

changpe1 commented 3 years ago

Patch has been merged, @karlatec the issues that block vfio-user performance tests have been fixed, I think we can start the performance tests now.