gnif / vendor-reset

Linux kernel vendor specific hardware reset module for sequences that are too complex/complicated to land in pci_quirks.c
GNU General Public License v2.0
764 stars 65 forks source link

Soft restart of guest OS leaves navi 5700 in bad state #49

Closed jd3nn1s closed 1 year ago

jd3nn1s commented 2 years ago

I am passing both video and audio devices through to the guest OS. Doing a soft restart of the guest leads to IO timeouts in the host OS.

[  144.319744] vfio-pci 0000:11:00.0: enabling device (0002 -> 0003)
[  144.319917] vfio-pci 0000:11:00.0: AMD_NAVI10: version 1.1
[  144.319918] vfio-pci 0000:11:00.0: AMD_NAVI10: performing pre-reset
[  144.339800] vfio-pci 0000:11:00.0: AMD_NAVI10: performing reset
[  144.444968] ATOM BIOS: 113-D1820201-101
[  144.444970] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[  144.703142] vfio-pci 0000:11:00.0: AMD_NAVI10: bus reset disabled? yes
[  144.703147] vfio-pci 0000:11:00.0: AMD_NAVI10: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes
[  144.703149] vfio-pci 0000:11:00.0: AMD_NAVI10: performing post-reset
[  144.739763] vfio-pci 0000:11:00.0: AMD_NAVI10: reset result = 0
[  144.739926] vfio-pci 0000:11:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  144.739938] vfio-pci 0000:11:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  144.739942] vfio-pci 0000:11:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  144.739944] vfio-pci 0000:11:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  144.739946] vfio-pci 0000:11:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  144.759555] vfio-pci 0000:11:00.1: enabling device (0000 -> 0002)
[  145.800049] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x200
[  146.049592] usb 5-6.3: USB disconnect, device number 5
[  148.155701] vfio-pci 0000:11:00.0: AMD_NAVI10: version 1.1
[  148.155703] vfio-pci 0000:11:00.0: AMD_NAVI10: performing pre-reset
[  148.155860] vfio-pci 0000:11:00.0: AMD_NAVI10: performing reset
[  148.261063] ATOM BIOS: 113-D1820201-101
[  148.261065] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[  148.522931] vfio-pci 0000:11:00.0: AMD_NAVI10: bus reset disabled? yes
[  148.522936] vfio-pci 0000:11:00.0: AMD_NAVI10: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes
[  148.522939] vfio-pci 0000:11:00.0: AMD_NAVI10: performing post-reset
[  148.559488] vfio-pci 0000:11:00.0: AMD_NAVI10: reset result = 0
[  180.367104] usb 5-6: reset full-speed USB device number 2 using xhci_hcd
[  488.013034] usb 5-6: reset full-speed USB device number 2 using xhci_hcd
[  490.288791] vfio-pci 0000:11:00.0: AMD_NAVI10: version 1.1
[  490.288792] vfio-pci 0000:11:00.0: AMD_NAVI10: performing pre-reset
[  490.288941] vfio-pci 0000:11:00.0: AMD_NAVI10: performing reset
[  490.394872] ATOM BIOS: 113-D1820201-101
[  490.394874] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[  490.394877] vfio-pci 0000:11:00.0: AMD_NAVI10: bus reset disabled? yes
[  490.394882] vfio-pci 0000:11:00.0: AMD_NAVI10: SMU response reg: 1, sol reg: cc8d3c5, mp1 intr enabled? yes, bl ready? yes
[  490.394883] vfio-pci 0000:11:00.0: AMD_NAVI10: Clearing scratch regs 6 and 7
[  490.395011] vfio-pci 0000:11:00.0: AMD_NAVI10: begin psp mode 1 reset
[  490.924367] vfio-pci 0000:11:00.0: AMD_NAVI10: mode1 reset succeeded
[  492.868373] vfio-pci 0000:11:00.0: AMD_NAVI10: PSP mode1 reset successful
[  492.868377] vfio-pci 0000:11:00.0: AMD_NAVI10: performing post-reset
[  492.908377] vfio-pci 0000:11:00.0: AMD_NAVI10: reset result = 0
[  493.313834] AMD-Vi: Completion-Wait loop timed out
[  493.697898] AMD-Vi: Completion-Wait loop timed out
[  493.820439] AMD-Vi: Completion-Wait loop timed out
[  493.989660] AMD-Vi: Completion-Wait loop timed out
[  494.112042] AMD-Vi: Completion-Wait loop timed out
[  494.193546] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c17e0]
[  494.316113] AMD-Vi: Completion-Wait loop timed out
[  494.465653] AMD-Vi: Completion-Wait loop timed out
[  494.588560] AMD-Vi: Completion-Wait loop timed out
[  494.711380] AMD-Vi: Completion-Wait loop timed out
[  494.833749] AMD-Vi: Completion-Wait loop timed out
[  495.195690] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c1810]
[  496.197585] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c1840]
[  497.199414] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c1870]
[  498.201329] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c18a0]
[  499.202942] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c18d0]
[  500.204820] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c1900]
[  501.206697] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c1930]
[  502.033036] AMD-Vi: Completion-Wait loop timed out
[  502.160830] AMD-Vi: Completion-Wait loop timed out
[  502.208669] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c1960]
[  502.347346] AMD-Vi: Completion-Wait loop timed out
[  502.470313] AMD-Vi: Completion-Wait loop timed out
[  502.593367] AMD-Vi: Completion-Wait loop timed out
[  502.716691] AMD-Vi: Completion-Wait loop timed out
[  502.840174] AMD-Vi: Completion-Wait loop timed out
[  502.921063] do_IRQ: 9.34 No irq handler for vector
[  502.962893] AMD-Vi: Completion-Wait loop timed out
[  502.984596] do_IRQ: 18.34 No irq handler for vector
[  502.984786] do_IRQ: 9.34 No irq handler for vector
[  502.984791] do_IRQ: 11.34 No irq handler for vector
[  502.984802] do_IRQ: 28.36 No irq handler for vector
[  503.085883] AMD-Vi: Completion-Wait loop timed out
[  503.209101] AMD-Vi: Completion-Wait loop timed out
[  503.210453] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c1990]
[  503.333541] AMD-Vi: Completion-Wait loop timed out
[  503.456831] AMD-Vi: Completion-Wait loop timed out
[  503.580268] AMD-Vi: Completion-Wait loop timed out
[  503.702970] AMD-Vi: Completion-Wait loop timed out
[  503.825977] AMD-Vi: Completion-Wait loop timed out
[  503.949168] AMD-Vi: Completion-Wait loop timed out
[  504.072509] AMD-Vi: Completion-Wait loop timed out
[  504.196137] AMD-Vi: Completion-Wait loop timed out
[  504.212324] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=11:00.0 address=0xff85c19c0]
pluser commented 2 years ago

What is your kernel version? If you are using 5.15+, you may be affected by this problem.

jd3nn1s commented 2 years ago

Kernel is version 5.4.0-91-generic #102-Ubuntu

jd3nn1s commented 2 years ago

Any steps I could do to debug my issue?

pluser commented 2 years ago

I'm using Radeon 5700 XT (navi 10) too. In my environment, soft reset is working fine. So this problem might be your environment specific.

Did you tried to update driver in guest OS?

jd3nn1s commented 1 year ago

I switched to a 6xxx