gnif / vendor-reset

Linux kernel vendor specific hardware reset module for sequences that are too complex/complicated to land in pci_quirks.c
GNU General Public License v2.0
764 stars 65 forks source link

Binding and unbinding from amdgpu -> unstable Windows VM until reboot #52

Open drujd opened 2 years ago

drujd commented 2 years ago

I have a 640SP version of RX550 (Polaris11-based) and it seems that something is missing from its reset routine to work correctly in a guest Windows 11 VM after using it with Linux/amdgpu driver before that (regardless of whether that happens on a host or in a guest Linux VM).

If the GPU is never bound to amdgpu (vfio-pci.ids=1002:67ff,1002:aae0 kernel param), it works perfectly. I can reboot, reset, shutdown & start the VM again and all is fine (but I think that was the case even without this module).

However, once I actually use the GPU in Linux (whether in a host or guest system doesn't matter), it is 'doomed' for Windows usage until (host) reboot. The VM actually seems to work at first and boots to windows, but after a while in a desktop (or immediately if I e.g. try to start Edge), the driver (21.12.1) crashes, screen blinks many times and after a while, Windows falls back to the basic driver. Reboot / hard reset / shutdown of the VM doesn't help, only reboot of the whole system does.

I am running Arch 5.15.12-arch1. I am aware of #46 and have 'w /sys/bus/pci/devices/0000:05:00.0/reset_method - - - - device_specific' in tmpfiles.d and the module seems to work 'correctly':

systemd[1]: Started Virtual Machine qemu-1-win11.
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: version 1.1
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing pre-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: CLOCK_CNTL: 0x0, PC: 0x20594
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: Performing BACO reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing post-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: reset result = 0
kernel: vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
kernel: vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
kernel: vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x1e@0x370
kernel: vfio-pci 0000:05:00.1: enabling device (0000 -> 0002)
kernel: vfio-pci 0000:0f:00.3: enabling device (0000 -> 0002)
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: version 1.1
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing pre-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: CLOCK_CNTL: 0x0, PC: 0x2880
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing post-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: reset result = 0

Maybe the reset routine for Polaris is just incomplete?

drujd commented 2 years ago

OK, the issue stops manifesting when I DISABLE 'Above 4G decoding' in BIOS. Weird, some people with AMD cards reported that passthrough works for them only with it enabled... (And yes, I know resizeable BAR is not supported, that has always been off)

cppmonkey commented 2 years ago

Good to know!

Have an Asrock X570D4U (Ryzen 5700G) running Proxmox 7.1 (Kernel 5.13) and passing through 2x Radeon RX460 (Same chipset as your RX550). Don't seem to have a reset issue. But passing a card through to a guest using DP, it would reset the host upon the DE loading. Moving to using HDMI... the issue wen't away goes away. But I had to disable "Power Saving - Black Screen" or the guest would freeze. Not sure if Above 4G decoding is enabled - I'll have to check

My desktop (Ryzen 9 3950X, Radeon RX 5600XT) seems to have a similar issue. DP results in the system randomly not waking up the screen. Have to login remotely to reboot the system. Using HDMI works fine, with the exception of the screen doesn't go to sleep.

Curious if you're system is Intel and AMD powered?

drujd commented 2 years ago

Asus X570-E Ryzen 5950X Vega 64 & RX550 (640SP) 4GiB

cppmonkey commented 2 years ago

Turns out Above 4G decoding was enabled on the X570D2U. Started running a guest and passed both RX460's through - Worked fine for 30 mins and then GPU0 crashed locking up the system. Halt and restart - 10 Mins stable Halt and restart - 5 Mins stable

Given they take power from the PCIe interface, wonder if there is a power/heat issue. But GPU0 didn't feel especially hot

Been stable with a single card, only issue the screens wont go to sleep. Go off and instantly wake up. Its interesting that you need to use this vendor-reset project, whilst I haven't needed to. However I am running a Linux guest and not a Windows.

Will give a live FC35 drive a go. See if the instability remains with 2x RX460's (and the AT2500, (Cezanne) Vega 8) GPUs

drujd commented 2 years ago

I don't think you have to use vendor-reset for Polaris cards as long as they gracefully shut down, but this project should allow them to recover from bad states caused by VM crashes, bad implementations of shut down procedure (in MacOs IIRC) etc.

Honestly, neither of your issues seems connected to the reset bug.

bitshiftnetau commented 2 years ago

Not sure if I'm facing the same issue exactly, but certainly the same symptoms as I'm sure you are facing. Windows 10 VM, Navi 23 RX6600 (currently not fully supported by this module afaik). Random shutdowns and then Proxmox requires a full system reboot.