gnif / vendor-reset

Linux kernel vendor specific hardware reset module for sequences that are too complex/complicated to land in pci_quirks.c
GNU General Public License v2.0
762 stars 62 forks source link

eGPU Not Resetting #69

Open patchrick84 opened 1 year ago

patchrick84 commented 1 year ago

I posted this in the Proxmox forum first thinking it was a Proxmox issue. But a person more knowledgeable than I am told me I might want to start here instead. Here's the post from over there:

I have a Radeon RX 5500 XT - which is a Sonnet eGPU Puck - connected via Thunderbolt 3. The host is an HP Elite Mini 600 G9 with an i7 and 32GB of RAM. I'm 99% sure I have passthrough actually working - all the IOMMU checks line up with what I've read in the thread above, and elsewhere. In fact, the GPU actually worked once in the Windows 10 VM. But since then, I get a "Code 43" error in Windows. I also tried Ubuntu and got some display output the first time, but nothing since then. I'm also pretty sure this is related to the vendor-reset situation. Sometimes if I unplug the eGPU from the Thunderbolt cable, I can get some activity from it in the VM. I'm just not sure what else to try, although I'm relatively inexperienced in Linux and Proxmox, but can usually find my way around with help from Google.

Here's the information I think is relevant to helping me out here. Please let me know if there's anything else you need to know. I'm really hoping to get this at least semi-functional. Any help that can be provided would be greatly appreciated!

I'm on Proxmox 7.3-6 with the 5.19 kernel.

/etc/default/grub - Part of the GRUB_CMDLINE_LINUX_DEFAULT is Intel iGPU passthrough for Plex transcoding in an LXC.

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on i915.enable_gvt=1 iommu=pt kvm.ignore_msrs=1"
GRUB_CMDLINE_LINUX=""

VM.conf

agent: 1
args: -acpitable file=/usr/slic/slic_table
balloon: 0
bios: ovmf
boot: order=scsi0;ide2
cores: 12
cpu: host
efidisk0: SSD2:110/vm-110-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:0a:00,pcie=1,rombar=0
ide2: none,media=cdrom
machine: pc-q35-7.0
memory: 16384
meta: creation-qemu=7.1.0,ctime=1673898377
name: Windows
net0: virtio=3E:ED:7E:AE:16:BD,bridge=vmbr0
numa: 0
ostype: win10
scsi0: SSD2:110/vm-110-disk-2.qcow2,cache=writeback,discard=on,iothread=1,size=1T,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=8c850248-1dc4-4b2e-9d9b-634144de2472
sockets: 1
vga: none
vmgenid: 1738711b-0ea9-44c1-ae4e-af3dc77b1458

journalctl -b 0 | grep reset once VM is booted.

Feb 10 21:19:46 AmtrickPVE QEMU[36146]: kvm: vfio: Cannot reset device 0000:0a:00.1, no available reset mechanism.
Feb 10 21:19:46 AmtrickPVE QEMU[36146]: kvm: vfio: Cannot reset device 0000:0a:00.1, no available reset mechanism.
Feb 10 21:19:46 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing pre-reset
Feb 10 21:19:46 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing reset
Feb 10 21:19:46 AmtrickPVE kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 21:19:47 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: bus reset disabled? yes
Feb 10 21:19:47 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: begin psp mode 1 reset
Feb 10 21:19:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: mode1 reset succeeded
Feb 10 21:19:58 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: timed out waiting for PSP bootloader to respond after reset
Feb 10 21:19:58 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: failed to reset device
Feb 10 21:19:58 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing post-reset
Feb 10 21:19:58 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: reset result = 0
Feb 10 21:20:08 AmtrickPVE qmeventd[944]: read: Connection reset by peer
Feb 10 21:20:08 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing pre-reset
Feb 10 21:20:08 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing reset
Feb 10 21:20:08 AmtrickPVE kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 21:20:09 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: bus reset disabled? yes
Feb 10 21:20:09 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: begin psp mode 1 reset
Feb 10 21:20:10 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: mode1 reset succeeded
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: timed out waiting for PSP bootloader to respond after reset
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: failed to reset device
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing post-reset
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: reset result = 0
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing pre-reset
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing reset
Feb 10 21:20:21 AmtrickPVE kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: bus reset disabled? yes
Feb 10 21:20:21 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: begin psp mode 1 reset
Feb 10 21:20:22 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: mode1 reset succeeded
Feb 10 21:20:33 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: timed out waiting for PSP bootloader to respond after reset
Feb 10 21:20:33 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: failed to reset device
Feb 10 21:20:33 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing post-reset
Feb 10 21:20:33 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: reset result = 0
Feb 10 21:20:35 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing pre-reset
Feb 10 21:20:35 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing reset
Feb 10 21:20:35 AmtrickPVE kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 21:20:36 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: bus reset disabled? yes
Feb 10 21:20:36 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: begin psp mode 1 reset
Feb 10 21:20:37 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: mode1 reset succeeded
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: timed out waiting for PSP bootloader to respond after reset
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: failed to reset device
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing post-reset
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: reset result = 0
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing pre-reset
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing reset
Feb 10 21:20:48 AmtrickPVE kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: bus reset disabled? yes
Feb 10 21:20:48 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: begin psp mode 1 reset
Feb 10 21:20:49 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: mode1 reset succeeded
Feb 10 21:21:00 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: timed out waiting for PSP bootloader to respond after reset
Feb 10 21:21:00 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: failed to reset device
Feb 10 21:21:00 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: performing post-reset
Feb 10 21:21:00 AmtrickPVE kernel: vfio-pci 0000:0a:00.0: AMD_NAVI14: reset result = 0