Closed tmakatos closed 4 years ago
Because of an unrelated bug, the process providing device emulation died (specifically it's muser SPDK, https://github.com/tmakatos/spdk):
[ 5.793367] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10 qemu-system-x86_64: vfio_region_write(00000000-0000-0000-0000-000000000000:region0+0x14, 0x0,4) failed: Operation not permitted qemu-system-x86_64: vfio_region_read(00000000-0000-0000-0000-000000000000:region0+0x1c, 4) failed: Timer expired [ 10.837434] nvme nvme0: Removing after probe failure status: -19 qemu-system-x86_64: vfio_region_read(00000000-0000-0000-0000-000000000000:region0+0x1c, 4) failed: Timer expired qemu-system-x86_64: vfio_pci_write_config(00000000-0000-0000-0000-000000000000, 0x4a, 0x3, 0x2) failed: Operation not permitted qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x3d, 0x1) failed: Timer expired qemu-system-x86_64: vfio 00000000-0000-0000-0000-000000000000: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Operation not permitted qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x4, 0x2) failed: Timer expired qemu-system-x86_64: vfio_pci_write_config(00000000-0000-0000-0000-000000000000, 0x4, 0xfbc2, 0x2) failed: Operation not permitted qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x5c, 0x2) failed: Timer expired qemu-system-x86_64: vfio_pci_write_config(00000000-0000-0000-0000-000000000000, 0x5c, 0xffc0, 0x2) failed: Operation not permitted qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x4, 0x2) failed: Timer expired qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x0, 0x4) failed: Timer expired
The host kernel:
[1476060.673557] reactor_0[217443]: segfault at fc ip 0000558c28d78f64 sp 00007fcec1dfaa10 error 4 in nvmf_tgt[558c28cba000+1df000] [1476060.675394] Code: 00 00 00 00 e8 58 8b 00 00 48 8b 45 f8 0f b6 00 83 e0 01 84 c0 0f 85 ae 01 00 00 48 8b 45 e8 48 8b 80 78 07 00 00 48 8b 40 20 <0f> b6 80 fc 00 00 00 83 e0 01 84 c0 0f 84 8d 01 00 00 48 8d 05 c3 [1476060.709588] muser muser: libmuser_release: moving command back in list [1476065.701220] muser muser: muser_process_cmd: giving up, no response for cmd 3 [1476070.821272] muser muser: muser_process_cmd: giving up, no response for cmd 2 [1476075.941249] muser muser: muser_process_cmd: giving up, no response for cmd 2 [1476081.061339] muser muser: muser_process_cmd: giving up, no response for cmd 3 [1476081.063049] muser muser: muser_ioctl_setup_cmd: ignore DATA_NONE index=2 start=0 count=0 [1476086.181391] muser muser: muser_process_cmd: giving up, no response for cmd 1 [1476091.301391] muser muser: muser_process_cmd: giving up, no response for cmd 2 [1476096.421497] muser muser: muser_process_cmd: giving up, no response for cmd 1 [1476101.541559] muser muser: muser_process_cmd: giving up, no response for cmd 2 [1476106.661604] muser muser: muser_process_cmd: giving up, no response for cmd 3 [1476111.781594] muser muser: muser_process_cmd: giving up, no response for cmd 2 [1476116.901669] muser muser: muser_process_cmd: giving up, no response for cmd 3 [1476122.021798] muser muser: muser_process_cmd: giving up, no response for cmd 2 [1476127.653830] muser muser: muser_process_cmd: giving up, no response for cmd 2 [1476258.950949] muser muser: MDEV: Unregistering [1476268.622393] BUG: unable to handle page fault for address: 0000000234ff6038 [1476268.623160] #PF: supervisor read access in kernel mode [1476268.623692] #PF: error_code(0x0000) - not-present page [1476268.624243] PGD 0 P4D 0 [1476268.624763] Oops: 0000 [#1] SMP PTI [1476268.625273] CPU: 5 PID: 217452 Comm: qemu-system-x86 Tainted: G W O 5.4.0-rc8+ #11 [1476268.625787] Hardware name: Nutanix AHV, BIOS 1.9.1-5.el6 04/01/2014 [1476268.626319] RIP: 0010:__mutex_lock.isra.7+0xd0/0x4d0 [1476268.626895] Code: 39 c2 74 b7 48 89 c2 eb 9c 65 48 8b 04 25 c0 6b 01 00 48 8b 00 a8 08 0f 85 8f 00 00 00 49 8b 07 48 83 e0 f8 0f 84 77 03 00 00 <8b> 50 38 85 d2 0f 85 06 01 00 00 85 d2 74 73 4d 8d 67 0c 4c 89 e7 [1476268.628101] RSP: 0018:ffffb1fc00ab3b80 EFLAGS: 00010206 [1476268.628693] RAX: 0000000234ff6000 RBX: ffff902a2e5a6100 RCX: 0000000234ff6000 [1476268.629309] RDX: 0000000234ff6000 RSI: ffff902a2f3ff080 RDI: ffff902a2e5a6150 [1476268.629983] RBP: ffffb1fc00ab3c20 R08: 0000000000000000 R09: 0000000000000000 [1476268.630617] R10: ffffb1fc00ab3c40 R11: ffff902a2b2a0d10 R12: ffffb1fc00ab3c40 [1476268.631228] R13: ffff902a2cfc1900 R14: 0000000000000002 R15: ffff902a2e5a6150 [1476268.631883] FS: 0000000000000000(0000) GS:ffff902a34140000(0000) knlGS:0000000000000000 [1476268.632548] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1476268.633209] CR2: 0000000234ff6038 CR3: 00000000a3c0a004 CR4: 00000000001606e0 [1476268.633915] Call Trace: [4/1804] [1476268.634628] ? locks_remove_posix+0xc8/0x160 [1476268.635327] ? dma_unmap_all+0x45/0x150 [muser] [1476268.636009] dma_unmap_all+0x45/0x150 [muser] [1476268.636734] muser_close+0x1a/0x90 [muser] [1476268.637413] vfio_mdev_release+0x1e/0x30 [vfio_mdev] [1476268.638123] vfio_device_fops_release+0x1e/0x40 [vfio] [1476268.638813] __fput+0xbe/0x250 [1476268.639551] task_work_run+0x8a/0xb0 [1476268.640219] do_exit+0x2e0/0xbb0 [1476268.640846] do_group_exit+0x3a/0xa0 [1476268.641444] get_signal+0x16d/0x8e0 [1476268.642051] ? __wake_up_common+0x96/0x180 [1476268.642607] do_signal+0x30/0x690 [1476268.643191] ? eventfd_write+0xbe/0x2a0 [1476268.643779] ? __x64_sys_futex+0x87/0x170 [1476268.644334] exit_to_usermode_loop+0x91/0xf0 [1476268.644933] do_syscall_64+0x159/0x180 [1476268.645497] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [1476268.646081] RIP: 0033:0x7fc799ec317f [1476268.646633] Code: Bad RIP value. [1476268.647251] RSP: 002b:00007fc73d3e3510 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [1476268.647872] RAX: fffffffffffffe00 RBX: 000055a5aa5ab1f0 RCX: 00007fc799ec317f [1476268.648531] RDX: 00000000000014e7 RSI: 0000000000000080 RDI: 000055a5aa5ab1f4 [1476268.649146] RBP: 000055a5a90e97e0 R08: 000055a5a90e9700 R09: 0000000000000a73 [1476268.649789] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a5a885b774 [1476268.650422] R13: 00000000000004dd R14: 00007fc73cbe6000 R15: 0000000000000003 [1476268.651078] Modules linked in: muser(O-) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag vfio_mdev mdev vfio_iommu_type1 vfio binfmt_misc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel bochs_drm drm_vram_helper snd_pcm ttm aesni_intel snd_timer crypto_simd drm_kms_helper snd cryptd glue_helper soundcore evdev joydev sg drm serio_raw pcspkr virtio_balloon button ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid sr_mod cdrom sd_mod ata_generic virtio_net net_failover virtio_scsi failover crc32c_intel ata_piix uhci_hcd ehci_pci ehci_hcd libata psmouse virtio_pci v$ rtio_ring i2c_piix4 usbcore virtio scsi_mod floppy [last unloaded: muser] [1476268.655755] CR2: 0000000234ff6038 [1476268.656534] ---[ end trace 7cf53999121ec065 ]--- [1476268.657271] RIP: 0010:__mutex_lock.isra.7+0xd0/0x4d0 [1476268.658503] Code: 39 c2 74 b7 48 89 c2 eb 9c 65 48 8b 04 25 c0 6b 01 00 48 8b 00 a8 08 0f 85 8f 00 00 00 49 8b 07 48 83 e0 f8 0f 84 77 03 00 00 <8b> 50 38 85 d2 0f 85 06 01 00 00 85 d2 74 73 4d 8d 67 0c 4c 89 e7 [1476268.661380] RSP: 0018:ffffb1fc00ab3b80 EFLAGS: 00010206 [1476268.662773] RAX: 0000000234ff6000 RBX: ffff902a2e5a6100 RCX: 0000000234ff6000 [1476268.664248] RDX: 0000000234ff6000 RSI: ffff902a2f3ff080 RDI: ffff902a2e5a6150 [1476268.665585] RBP: ffffb1fc00ab3c20 R08: 0000000000000000 R09: 0000000000000000 [1476268.667002] R10: ffffb1fc00ab3c40 R11: ffff902a2b2a0d10 R12: ffffb1fc00ab3c40 [1476268.668475] R13: ffff902a2cfc1900 R14: 0000000000000002 R15: ffff902a2e5a6150 [1476268.669943] FS: 0000000000000000(0000) GS:ffff902a34140000(0000) knlGS:0000000000000000 [1476268.671391] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1476268.672700] CR2: 00007fc799ec3155 CR3: 00000000a3c0a004 CR4: 00000000001606e0 [1476268.674077] Fixing recursive fault but reboot is needed! [1476284.630049] vfio_mdev 00000000-0000-0000-0000-000000000000: Device is currently in use, task "rmmod" (223678) blocked until device is released
I then ran rmmod trying to clean up, which got stuck (included in previous stack trace). The device emulation application cannot be killed. This requires a hard reset to rectify.
rmmod
FYI @swapnili
Tested and confirmed that #35 fixes it.
Because of an unrelated bug, the process providing device emulation died (specifically it's muser SPDK, https://github.com/tmakatos/spdk):
The host kernel:
I then ran
rmmod
trying to clean up, which got stuck (included in previous stack trace). The device emulation application cannot be killed. This requires a hard reset to rectify.FYI @swapnili