Open drujd opened 2 years ago
OK, the issue stops manifesting when I DISABLE 'Above 4G decoding' in BIOS. Weird, some people with AMD cards reported that passthrough works for them only with it enabled... (And yes, I know resizeable BAR is not supported, that has always been off)
Good to know!
Have an Asrock X570D4U (Ryzen 5700G) running Proxmox 7.1 (Kernel 5.13) and passing through 2x Radeon RX460 (Same chipset as your RX550). Don't seem to have a reset issue. But passing a card through to a guest using DP, it would reset the host upon the DE loading. Moving to using HDMI... the issue wen't away goes away. But I had to disable "Power Saving - Black Screen" or the guest would freeze. Not sure if Above 4G decoding is enabled - I'll have to check
My desktop (Ryzen 9 3950X, Radeon RX 5600XT) seems to have a similar issue. DP results in the system randomly not waking up the screen. Have to login remotely to reboot the system. Using HDMI works fine, with the exception of the screen doesn't go to sleep.
Curious if you're system is Intel and AMD powered?
Asus X570-E Ryzen 5950X Vega 64 & RX550 (640SP) 4GiB
Turns out Above 4G decoding was enabled on the X570D2U. Started running a guest and passed both RX460's through - Worked fine for 30 mins and then GPU0 crashed locking up the system. Halt and restart - 10 Mins stable Halt and restart - 5 Mins stable
Given they take power from the PCIe interface, wonder if there is a power/heat issue. But GPU0 didn't feel especially hot
Been stable with a single card, only issue the screens wont go to sleep. Go off and instantly wake up. Its interesting that you need to use this vendor-reset project, whilst I haven't needed to. However I am running a Linux guest and not a Windows.
Will give a live FC35 drive a go. See if the instability remains with 2x RX460's (and the AT2500, (Cezanne) Vega 8) GPUs
I don't think you have to use vendor-reset for Polaris cards as long as they gracefully shut down, but this project should allow them to recover from bad states caused by VM crashes, bad implementations of shut down procedure (in MacOs IIRC) etc.
Honestly, neither of your issues seems connected to the reset bug.
Not sure if I'm facing the same issue exactly, but certainly the same symptoms as I'm sure you are facing. Windows 10 VM, Navi 23 RX6600 (currently not fully supported by this module afaik). Random shutdowns and then Proxmox requires a full system reboot.
I have a 640SP version of RX550 (Polaris11-based) and it seems that something is missing from its reset routine to work correctly in a guest Windows 11 VM after using it with Linux/amdgpu driver before that (regardless of whether that happens on a host or in a guest Linux VM).
If the GPU is never bound to amdgpu (vfio-pci.ids=1002:67ff,1002:aae0 kernel param), it works perfectly. I can reboot, reset, shutdown & start the VM again and all is fine (but I think that was the case even without this module).
However, once I actually use the GPU in Linux (whether in a host or guest system doesn't matter), it is 'doomed' for Windows usage until (host) reboot. The VM actually seems to work at first and boots to windows, but after a while in a desktop (or immediately if I e.g. try to start Edge), the driver (21.12.1) crashes, screen blinks many times and after a while, Windows falls back to the basic driver. Reboot / hard reset / shutdown of the VM doesn't help, only reboot of the whole system does.
I am running Arch 5.15.12-arch1. I am aware of #46 and have 'w /sys/bus/pci/devices/0000:05:00.0/reset_method - - - - device_specific' in tmpfiles.d and the module seems to work 'correctly':
Maybe the reset routine for Polaris is just incomplete?