gnif / vendor-reset

Linux kernel vendor specific hardware reset module for sequences that are too complex/complicated to land in pci_quirks.c
GNU General Public License v2.0
756 stars 62 forks source link

What's the simplest way to remove Polaris 11 from vendor-reset (RX560 issue) #4

Open hippsabq opened 3 years ago

hippsabq commented 3 years ago

I run two graphics cards in my Proxmox setup (a 5700XT and RX560) and pass them among my Win/Mac/Linux VMs. Up until now it's been a real pain having to reset bare metal every time I reassigned the 5700XT (RX560 didn't have reset bug), so I thank you greatly for your work!

The only issue that I think I'm having now is the vendor-reset is messing with the RX560 and I can't reset VM's with it. I'm still diagnosing but I was wondering if there is something in the code I could snip out so Polaris 11 isn't affected.

thanks!

hippsabq commented 3 years ago

I removed my card (67ff) from /src/device-db.h and all is back to good now

gnif commented 3 years ago

Can you please provide more details?

Most people we encounter have issues with the RX560 reset and need this module, you're lucky you do not, it would be good to find out what the differences are.

hippsabq commented 3 years ago

Which brand/model of the RX560 do you have?

Gigabyte 4GB RX560

Which guest OS seems to cause the RX560 to fail most easily?

So, after some debugging it the reset bug with the RX560 will only show with my MacOS VM. I can shutdown/reset/move the card around the other Windows and Linux VMs, but as soon as it goes through my OpenCore/Big Sur VM, it becomes spoiled and throws a error code 40 in dmesg if I try to use it again in any VM.

Can you please provide the complete dmesg output (since boot) for when the RX560 fails to reset (or rather, should not reset).

I'll do a full reset and post the details below

hippsabq commented 3 years ago

Here is the pastebin for the full dmesg Proxmox boot, MacOS VM start, shutdown then failed restart, which led to code 40 CPU soft locku. : [https://pastebin.com/zvDSGnXm](full dmesg)

Here are the snippets:

Startup [ 30.620078] vfio-pci 0000:04:00.0: enabling device (0400 -> 0403) [ 30.620220] vfio-pci 0000:04:00.0: AMD_POLARIS11: version 1.1 [ 30.620220] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing pre-reset [ 30.620297] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing reset [ 30.620302] vfio-pci 0000:04:00.0: AMD_POLARIS11: CLOCK_CNTL: 0x0, PC: 0x298c [ 30.620302] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing post-reset [ 30.640102] vfio-pci 0000:04:00.0: AMD_POLARIS11: reset result = 0

Shutdown

[ 131.092910] vfio-pci 0000:04:00.0: AMD_POLARIS11: version 1.1 [ 131.092912] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing pre-reset [ 131.092992] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing reset [ 131.092996] vfio-pci 0000:04:00.0: AMD_POLARIS11: CLOCK_CNTL: 0x0, PC: 0x2995c [ 131.092999] vfio-pci 0000:04:00.0: AMD_POLARIS11: Performing BACO reset [ 131.264740] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing post-reset [ 131.284699] vfio-pci 0000:04:00.0: AMD_POLARIS11: reset result = 0

Failed Restart

[ 200.692452] vfio-pci 0000:04:00.0: AMD_POLARIS11: version 1.1 [ 200.692454] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing pre-reset [ 200.692547] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing reset [ 200.692553] vfio-pci 0000:04:00.0: AMD_POLARIS11: CLOCK_CNTL: 0x0, PC: 0x2ac8 [ 200.692553] vfio-pci 0000:04:00.0: AMD_POLARIS11: performing post-reset [ 200.712364] vfio-pci 0000:04:00.0: AMD_POLARIS11: reset result = 0 [ 202.323267] device tap107i0 entered promiscuous mode [ 202.354331] fwbr107i0: port 1(fwln107i0) entered blocking state [ 202.354333] fwbr107i0: port 1(fwln107i0) entered disabled state [ 202.354388] device fwln107i0 entered promiscuous mode [ 202.354438] fwbr107i0: port 1(fwln107i0) entered blocking state [ 202.354439] fwbr107i0: port 1(fwln107i0) entered forwarding state [ 202.357494] vmbr0: port 2(fwpr107p0) entered blocking state [ 202.357495] vmbr0: port 2(fwpr107p0) entered disabled state [ 202.357543] device fwpr107p0 entered promiscuous mode [ 202.357588] vmbr0: port 2(fwpr107p0) entered blocking state [ 202.357588] vmbr0: port 2(fwpr107p0) entered forwarding state [ 202.360370] fwbr107i0: port 2(tap107i0) entered blocking state [ 202.360371] fwbr107i0: port 2(tap107i0) entered disabled state [ 202.360447] fwbr107i0: port 2(tap107i0) entered blocking state [ 202.360448] fwbr107i0: port 2(tap107i0) entered forwarding state [ 202.405532] DMAR: DRHD: handling fault status reg 40 [ 202.406044] DMAR: DRHD: handling fault status reg 40 [ 202.406249] DMAR: DRHD: handling fault status reg 40 [ 202.406556] DMAR: DRHD: handling fault status reg 40 [ 202.406762] DMAR: DRHD: handling fault status reg 40 [ 202.406967] DMAR: DRHD: handling fault status reg 40 [ 202.407172] DMAR: DRHD: handling fault status reg 40 [ 202.407275] DMAR: DRHD: handling fault status reg 40

hippsabq commented 3 years ago

....and just to make sure I wasn't mis-remembering things, I disabled vendor-reset and then confirmed that the 560 did not have a problem resetting or switching among my VMs.

Here is the specific model: https://www.gigabyte.com/us/Graphics-Card/GV-RX560OC-4GD-rev-20#kf

gnif commented 3 years ago

Awesome, thanks! I purchased one of these for testing last week and it just arrived, you are not the first to report this. I will try to replicate it and get back to you.

gnif commented 3 years ago

I can not replicate this fault, however your device reports as 1002:67ff where mine is 1002:67ef. Same SOC but obviously there is a difference. Can you please provide the BIOS from your GPU?

hippsabq commented 3 years ago

Great! Thanks for the help and all the work!

Here is the ROM I extracted

https://drive.google.com/file/d/1BAbUQI3d3ScFbpBGcoeeVhDMCocvFQCf/view?usp=sharing

gnif commented 3 years ago

Thanks, I will try to make time for this over the coming week.

gnif commented 3 years ago

It seems you're not passing through the audio device also (0000:04:00.1), vfio passthrough doesn't work reliably if you don't pass through the entire device in most instances. Can you please add this to your VM and try again?

hippsabq commented 3 years ago

The 04:00.1 is combined with the 04:00.0 passthrough. I've never had any HDMI Audio issues from the rx560 with passing through just 04:00.0. I tried it anyway and got the error below:

kvm: -device vfio-pci,host=0000:04:00.1,id=hostpci3,bus=ich9-pcie-port-4,addr=0x0: vfio 0000:04:00.1: device is already attached start failed: QEMU exited with code 1

hippsabq commented 3 years ago

If it helps at all, here is another user with the same issue using Polaris 10 (rx580) on a MacOS VM: https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/

thenickdude commented 3 years ago

That's me. Happy to provide logs but it looks the same as the other reports so far.

gnif commented 3 years ago

Can you please provide the following information:

Vendor Device ID (as obtained from lspci -nn) Manufacturer (ie. Gigabyte) Marketed Model Name Motherboard Host BIOS Mode: CSM or UEFI Is it your boot GPU? Host Kernel Version Have you flashed a modded or different bios? VBIOS Version ATOM BIOS Version (if applicable, amdgpu and vendor-reset both will print this out in the dmesg) Guest OS: win/linux/osx

Works Without Vendor-Reset: yes/no/partial Graceful Reboot of Guest Works: yes/no/partial Force Stop/Start of Guest Works: yes/no/partial

Ansa89 commented 3 years ago

Since gnif/vendor-reset#7 is a duplicate:

About VBIOS/ATOM BIOS: I can try to read it from the hypervisor, however I would appreciate a link (or a guide) to the right software to do it.

thenickdude commented 3 years ago

You can read vBIOS version from a Linux hypervisor like so (just substitute your correct PCIe address):

echo 1 > /sys/bus/pci/devices/0000:03:00.0/rom
cat      /sys/bus/pci/devices/0000:03:00.0/rom > vbios
echo 0 > /sys/bus/pci/devices/0000:03:00.0/rom

strings vbios | head -n 20

For example:

 761295520
08/29/19 02:18
113-1E3870U-O4Q
POLARIS20
PCI_EXPRESS
GDDR5
E387 Polaris20 XTX A1 GDDR5 256Mx32 8GB
(C) 1988-2010, Advanced Micro Devices, Inc.
ATOMBIOSBK-AMD VER015.050.002.001.000000

Version here is "015.050.002.001.000000"

Ansa89 commented 3 years ago

Thank you very much for the mini how-to., however I get an I/O error when I try to actually read the vbios (the cat /sys/...../rom > vbios part). Could it be related to the fact that I'm using pciback kernel module as driver for that device?

hippsabq commented 3 years ago

Vendor Device ID:

04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Polaris11] [1002:67ff] (rev cf) 04:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]

Manufacturer: Gigabyte Marketed Model Name: Gigabyte RX560 OC 4G (rev 2) Motherboard: HP Proprietary (Z620 Workstation https://support.hp.com/us-en/document/c03270936) Host BIOS Mode: UEFI Is it your boot GPU? No Host Kernel Version: 5.4.73-1-pve (Proxmox) Have you flashed a modded or different bios? No VBIOS Version: ATOM BIOS Version: ATOMBIOSBK-AMD VER015.050.002.001.000000 Guest OS: No Issues: Windows 10, PoP OS Issues: OpenCore MacOS BigSur Works Without Vendor-Reset: yes, with all OSs Graceful Reboot of Guest Works: partial (works with Win/Linux, but not MacOS) Force Stop/Start of Guest Works: partial (works with Win/Linux, but not MacOS)

thenickdude commented 3 years ago

Vendor Device ID: 1002:67df (GPU) 1002:aaf0 (Audio) Manufacturer: Sapphire Marketed Model Name: Sapphire Pulse Radeon RX 580 8GB GDDR5 (11265-05-20G) Motherboard: Asrock EP2C602 Host BIOS Mode: UEFI Is it your boot GPU? No Host Kernel Version: 5.4.78-1-pve (Proxmox) Have you flashed a modded or different bios? No VBIOS Version 015.050.002.001.000000 Guest OS: macOS Catalina, macOS Big Sur, Ubuntu, Windows 10

Works Without Vendor-Reset: Partial - sometimes it can be reused for subsequent boots. Vendor reset does not improve the situation.

Graceful Reboot of Guest Works: Yes for Windows 10 guests, only sometimes for macOS guests Force Stop/Start of Guest Works: No

When my guest OS tries to bring up the GPU using its AMD drivers (Windows 10 or macOS Big Sur), I get DMAR: DRHD: handling fault status reg 40 followed by reported host kernel thread lockups that kill the host.

Before adding vendor-reset I would see faults like this reported from the PCIe root port the card was attached to instead, on the second guest boot:

pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:02.0
pcieport 0000:00:02.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:00:02.0: AER: device [8086:3c04] error status/mask=00004000/00000000
pcieport 0000:00:02.0: AER: [14] CmpltTO (First)
pcieport 0000:00:02.0: AER: Device recovery successful

I no longer see these appearing, I get that DMAR error instead.

thenickdude commented 3 years ago

Here's a log I captured earlier of what the detected host lockup looks like on my system: https://pastebin.com/raw/aeWxpGE0

I see drm_gem_vram_object_vunmap on the trace which seems interesting? EDIT: Nope, I tried modprobe --remove drm_vram_helper drm_kms_helper drm ttm and the situation did not improve, guess it was just a coincidence that it was one of the threads affected by the locked-up CPU.

Ansa89 commented 3 years ago

I tried to pass the gpu to a Debian 10.7 VM (BIOS or UEFI leads to same result), this is the related dmesg output:

[drm] amdgpu kernel modesetting enabled.
[drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1DA2:0xE343 0xEF).
[drm] register mmio base: 0xF1A00000
[drm] register mmio size: 262144
[drm] add ip block number 0 <vi_common>
[drm] add ip block number 1 <gmc_v8_0>
[drm] add ip block number 2 <tonga_ih>
[drm] add ip block number 3 <powerplay>
[drm] add ip block number 4 <dm>
[drm] add ip block number 5 <gfx_v8_0>
[drm] add ip block number 6 <sdma_v3_0>
[drm] add ip block number 7 <uvd_v6_0>
[drm] add ip block number 8 <vce_v3_0>
[drm] BIOS signature incorrect ff ff
amdgpu 0000:00:06.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xd65c
amdgpu 0000:00:06.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xd65c
[drm:amdgpu_get_bios [amdgpu]] *ERROR* Unable to locate a BIOS ROM
amdgpu 0000:00:06.0: Fatal error during GPU init
[drm] amdgpu: finishing device.
amdgpu: probe of 0000:00:06.0 failed with error -22

And this is the host dmesg output:

pciback 0000:01:00.0: AMD_POLARIS10: version 1.1
pciback 0000:01:00.0: AMD_POLARIS10: performing pre-reset
pciback 0000:01:00.0: AMD_POLARIS10: performing reset
pciback 0000:01:00.0: AMD_POLARIS10: CLOCK_CNTL: 0x0, PC: 0x2b58
pciback 0000:01:00.0: AMD_POLARIS10: performing post-reset
pciback 0000:01:00.0: AMD_POLARIS10: reset result = 0
Ansa89 commented 3 years ago

Are there any news about this?

gnif commented 3 years ago

@Ansa89 it seems your issue is unrelated, you simply need to specify a romfile for your GPU. This is an issue that randomly affects both NVIDIA and AMD GPUs and seems to be boot/bios related, not GPU related.

Ansa89 commented 3 years ago

Ok, I will try. However, please do note that without vendor-reset the gpu vbios is correctly read by the VM (both BIOS and UEFI), for this reason IMHO it looks related to how vendor-reset resets the card.

thenickdude commented 3 years ago

A commenter on my blog noted that they were able to dramatically reduce the hang rate of their card's reset by isolating the guest VM cores from the host:

"Linux53: I had the same issue with the RX 580, by isolating the cores that macOS guest is using, from the host, this issue “almost” disappear. (1 in 20 guest reboots the host gets crashed)"

Maybe this explains why my RX 580 reset fails so frequently for me on a second-start of my macOS guest (<15% success rate), since that VM is assigned all 32 of my cores, but usually works fine (75%+) when performing a second-start of my Linux and Windows guests, which are only assigned half as many cores.