ipaqmaster / vfio

A script for easy pci and usb passthrough along with disks, iso's and other useful flags for quick tinkering with less of a headache. I use it for VM gaming and other PCI/LiveCD/PXE/VM/RawImage testing given the script's accessibility.
GNU General Public License v3.0
189 stars 11 forks source link

AMD Single GPU Passthrough hangs up and never finishes #19

Closed shekhars-li closed 6 months ago

shekhars-li commented 9 months ago

I have a AMD CPU + single GPU (AMD 5600xt). I already installed Win10 and verified it's running fine without PCI passthrough. I then ran following:

sudo ./main -mem 8G -image /var/lib/libvirt/images/win10.qcow2 -imageformat qcow2 -bridge tap0,enp34s0 -pci 'Radeon|USB|HDMI Audio' -ignorevtcon -run -bios /usr/share/OVMF/OVMF_CODE_4M.fd -vbios /usr/share/vgabios/Sapphire.RX5600XT.6144.200314.rom -killx

-ignoreVtconn   specified, efi-framebuffer/vtcon bindings will be left as is. 
                AMD cards don't mind vtcons; this argument is to workaround a recent 
                NULL pointer dereference bug in fbcon.c) on NVIDIA-powered hosts 
                Follow the bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=216475 
-pinvcpus       not specified, Guest will get half host's core total as vcpus (No pinning):      3 hyperthreaded vcpu's (6/2) for a total of 6 vcpu threads (12/2).  
-memory         specified, guest will receive:                  8192 MB 
-image(s)       specified, using virtual disk(s) this run: 
                Driver: virtio-blk-pci 
                1 
                  Path:         /var/lib/libvirt/images/win10.qcow2 
                  Format:       qcow2           
-romfile        specified, if a GPU is detected in the -pci arguments this romfile will be used. 
                /usr/share/vgabios/Sapphire.RX5600XT.6144.200314.rom 
                Please confirm your romfile is safe with a project such as rom-parser before using this feature 
Host int not specified, will attach VM tap to existing bridge 
enp34s0 exists and is up, will attach tap0 to that. 
ioctl(TUNSETIFF): Device or resource busy
RTNETLINK answers: Operation not supported
------------------ 
Bridge details: 
        enp34s0: 
Bridge already existed, not running dhclient -r on it. 
------------------
-bridge         specified, VM will be bridged to the host with a tap adapter. 
PCI: 
  vfio-pci isn't loaded. Loading it now. 
  Matched:      03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller [1022:43d5] (rev 01) 
  IOMMU Group:  17 
    [INFO] Detected driver xhci_hcd is using this device. It will be re-bound on VM exit. 
    Adding ID and binding to:   vfio-pci 
  Matched:      28:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ca) 
  IOMMU Group:  20 
    [INFO] Detected driver amdgpu is using this device. It will be re-bound on VM exit. 
    Unbinding GPU from: amdgpu... 
    It appears Xorg has latched onto this GPU, cannot unbind from driver and give to guest without killing Xorg. 
    Stopping display-manager and unbinding console drivers... 
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:02 /sbin/init splash
    959 ?        Ss     0:00 /lib/systemd/systemd-logind
  19816 tty2     Sl+    0:01 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
  19939 ?        Ssl    0:03 /usr/bin/gnome-shell 
./main: line 477: 25000 Done                    echo "$fullBuspath"
     25001 Killed                  | sudo timeout --signal 9 5 tee /sys/bus/pci/devices/$fullBuspath/driver/unbind > /dev/null 2>&1
    Failed... Trying again with X killed... 
    This GPU is free. 
    Adding ID and binding to:   vfio-pci 
./main: line 479: 25047 Done                    echo "0x$vendor 0x$class"
     25048 Killed                  | sudo timeout --signal 9 5 tee /sys/bus/pci/drivers/vfio-pci/new_id > /dev/null 2>&1
    The device  0000:28:00.0 // 1002:731f  Was unable to bind via new_id after 5 seconds, is something else using it? 
    (E.g This will happen to a GPU in use by X) 
    Giving up. 

Cleaning up.. 
We only used tap0 on an existing bridge this run, removing tap0. 
tap0 removed. 
PCI: 
  vfio-pci isn't loaded. Loading it now. 
  Matched:      03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller [1022:43d5] (rev 01) 
  IOMMU Group:  17 
    Rebinding 1022:43d5 back to driver: xhci_hcd 
    Successfully rebound. 
  Matched:      28:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ca) 
  IOMMU Group:  20 
    Rebinding 1002:731f back to driver: amdgpu 

This never returns. I checked sudo lsof | grep amdgpu. This is the output:

amdgpu_dm   338                             root  cwd       DIR              259,2      4096          2 /
amdgpu_dm   338                             root  rtd       DIR              259,2      4096          2 /
amdgpu_dm   338                             root  txt   unknown                                         /proc/338/exe
amdgpu_dm   339                             root  cwd       DIR              259,2      4096          2 /
amdgpu_dm   339                             root  rtd       DIR              259,2      4096          2 /
amdgpu_dm   339                             root  txt   unknown                                         /proc/339/exe
amdgpu_dm   340                             root  cwd       DIR              259,2      4096          2 /
amdgpu_dm   340                             root  rtd       DIR              259,2      4096          2 /
amdgpu_dm   340                             root  txt   unknown                                         /proc/340/exe
amdgpu_dm   341                             root  cwd       DIR              259,2      4096          2 /
amdgpu_dm   341                             root  rtd       DIR              259,2      4096          2 /
amdgpu_dm   341                             root  txt   unknown                                         /proc/341/exe
tee       25134                             root    3w      REG               0,22      4096      36209 /sys/bus/pci/drivers/amdgpu/bind

lspci -k during this returns:

28:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev ca)
        Subsystem: Sapphire Technology Limited Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
        Kernel modules: amdgpu
28:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

I also manually tried modprobe -r amdgpu after killing gdm. This never works either. What am I doing wrong?

ipaqmaster commented 9 months ago

Hey there

In the middle of the script run there it looks like it failed to detach the graphics card from the amdgpu driver from the very beginning likely due to X. Then tried to kill X as permitted with -killx thinking that might do the trick but it looks like after the systemctl stop display-manager command were issued it looks like L:464 was still able to see Xorg which may have still been the cause. Regardless, Something is preventing you from freeing up the graphics card by continuing to use it..

Unfortunately I don't have any AMD graphics cards to test this with but the moment I end up with one I'll make sure the script knows everything about them for unbinding purposes and the hang at the end while not intended won't influence your actual problem. If anything I can look into prefixing additional timeouts to those cleanup rebinding attempts just in case of this scenario so you can at least get your shell back.

If you can make this happen again I'm not sure if AMD cards list themselves under /dev/dri but you could certainly try checking sudo lsof /dev/dri/* to make sure nothing pops up. If anything does pop up when running that command, you've found your culprit.

You should also check if systemctl status display-manager is even a real service which appears on your machine. If it is not then your X server will have to be killed a different way.

Please let me know how you go with the above two command checks.

shekhars-li commented 9 months ago

Hi @ipaqmaster Thanks a lot for responding and thanks for creating this script! Here's a response from sudo lsof /dev/dri/*

COMMAND    PID     USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
systemd      1     root  141u   CHR   226,0      0t0  408 /dev/dri/card0
systemd-l  880     root   54u   CHR   226,0      0t0  408 /dev/dri/card0
Xorg      2262 shekhars  mem    CHR   226,0           408 /dev/dri/card0
Xorg      2262 shekhars   12u   CHR   226,0      0t0  408 /dev/dri/card0
Xorg      2262 shekhars   13u   CHR   226,0      0t0  408 /dev/dri/card0
Xorg      2262 shekhars   14u   CHR   226,0      0t0  408 /dev/dri/card0
Xorg      2262 shekhars   15u   CHR   226,0      0t0  408 /dev/dri/card0
Xorg      2262 shekhars   16u   CHR   226,0      0t0  408 /dev/dri/card0
Xorg      2262 shekhars   17u   CHR   226,0      0t0  408 /dev/dri/card0
Xorg      2262 shekhars   18u   CHR   226,0      0t0  408 /dev/dri/card0
gnome-she 2410 shekhars  mem    CHR 226,128           407 /dev/dri/renderD128
gnome-she 2410 shekhars   11u   CHR 226,128      0t0  407 /dev/dri/renderD128
gnome-she 2410 shekhars   12u   CHR 226,128      0t0  407 /dev/dri/renderD128
gnome-she 2410 shekhars   13u   CHR 226,128      0t0  407 /dev/dri/renderD128
gnome-she 2410 shekhars   14u   CHR 226,128      0t0  407 /dev/dri/renderD128

I have tried everything and at one point (randomly) it worked when I was using start/revert script. I changed some params and since then didn't work. I like your approach a lot and it makes sense to me. So trying to make this work. Anyway, I always try to start this script after killing display manager and verifying sudo lsof | grep amdgpu to see if amdgpu is being used somewhere. Doesn't seem to be the case. Any other ideas I can try? Thanks!

shekhars-li commented 9 months ago

I killed gdm (I am logged in via ssh).

(base) shekhars@shekhars-desktop:~$ sudo systemctl stop display-manager

(base) shekhars@shekhars-desktop:~$ systemctl status display-manager
● gdm.service - GNOME Display Manager
     Loaded: loaded (/lib/systemd/system/gdm.service; static; vendor preset: enabled)
     Active: inactive (dead) since Sun 2024-01-14 23:10:12 PST; 36min ago
    Process: 8712 ExecStartPre=/usr/share/gdm/generate-config (code=exited, status=0/SUCCESS)
    Process: 8714 ExecStartPre=/usr/lib/gdm3/gdm-wait-for-drm (code=exited, status=0/SUCCESS)
    Process: 8715 ExecStart=/usr/sbin/gdm3 (code=exited, status=0/SUCCESS)
   Main PID: 8715 (code=exited, status=0/SUCCESS)

Jan 14 23:10:03 shekhars-desktop systemd[1]: Starting GNOME Display Manager...
Jan 14 23:10:03 shekhars-desktop systemd[1]: Started GNOME Display Manager.
Jan 14 23:10:03 shekhars-desktop gdm-launch-environment][8719]: pam_unix(gdm-launch-environment:session): session opened for user gdm by (uid=0)
Jan 14 23:10:12 shekhars-desktop systemd[1]: Stopping GNOME Display Manager...
Jan 14 23:10:12 shekhars-desktop systemd[1]: gdm.service: Succeeded.
Jan 14 23:10:12 shekhars-desktop systemd[1]: Stopped GNOME Display Manager.

(base) shekhars@shekhars-desktop:~$ sudo lsof | grep amdgpu
amdgpu_dm  337                            root  cwd       DIR              259,2      4096          2 /
amdgpu_dm  337                            root  rtd       DIR              259,2      4096          2 /
amdgpu_dm  337                            root  txt   unknown                                         /proc/337/exe
amdgpu_dm  338                            root  cwd       DIR              259,2      4096          2 /
amdgpu_dm  338                            root  rtd       DIR              259,2      4096          2 /
amdgpu_dm  338                            root  txt   unknown                                         /proc/338/exe
amdgpu_dm  339                            root  cwd       DIR              259,2      4096          2 /
amdgpu_dm  339                            root  rtd       DIR              259,2      4096          2 /
amdgpu_dm  339                            root  txt   unknown                                         /proc/339/exe
amdgpu_dm  340                            root  cwd       DIR              259,2      4096          2 /
amdgpu_dm  340                            root  rtd       DIR              259,2      4096          2 /
amdgpu_dm  340                            root  txt   unknown                                         /proc/340/exe

(base) shekhars@shekhars-desktop:~$ sudo lsof /dev/dri/*
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/125/gvfs
      Output information may be incomplete.

(base) shekhars@shekhars-desktop:~$

Seems to me amdgpu should be free to be unloaded?

shekhars-li commented 9 months ago

One final thing to add here: I followed your instructions on reddit (where I found this repo) to just try to unbind and rebind GPU. Unbind works but bind to vfio-pci gets stuck

echo 0000:28:00.0 > /sys/bus/pci/drivers/amdgpu/unbind  ---> done
echo 1002 731f > /sys/bus/pci/drivers/vfio-pci/new_id ---> stuck, does not return

I do not see anything noteworthy on dmesg or syslog.

ipaqmaster commented 8 months ago

Hmm that's interesting how the unbind works just fine but trying to add its id to vfio-pci (and it quietly binding itself to the card) is hanging. Your lsof output does seem to show the GPU is not in use (Or at least nothing in that directory is being used, which may or may not contain your gpu. May want to ls /dev/dri/* just to be certain the check isn't empty)

I'm not sure what that sudo lsof | grep amdgpu command is supposed to be showing but it's kind of implying there are processes still interacting with amdgpu_dm

Some more questions sorry

  1. Does anything appear in dmesg after letting the command hang for a few minutes for kernel calls to start timing out?
  2. Does your AMD gpu there on 0000:28 have any other subdevices which may need to also be unbound? lspci -D |grep 0000:28: should reveal any other sub-components of the graphics pci device.
  3. What distro version and kernel version are you running there?
  4. What consumer motherboard or full server hardware product are you running this on?
  5. It seems you specified -ignoreVtconn in the script run. It's possible the amdgpu driver is modesetting and the efi framebuffer could be holding on to the card. Could you try unbinding and rebinding the card again but after echo 0 > /sys/class/vtconsole/vtcon0/bind ; echo 0 > /sys/class/vtconsole/vtcon1/bind ; echo "vesa-framebuffer.0" > /sys/bus/platform/drivers/vesa-framebuffer/unbind ? This will stop the virtual consoles from drawing so you may need to SSH in from another machine to run these first.

If I can find a cheap one online I'll consider buying a second hand amd gpu to test your single gpu passthrough scenario and to try and reproduce the problem.

shekhars-li commented 8 months ago

@ipaqmaster I partly solved it. Your script seems to be fine. It just might be my environment. I threw kitchen sink at it.

What works:

What doesn't work:

Here's cleanup part of the run:

Cleaning up.. 
We only used tap0 on an existing bridge this run, removing tap0. 
tap0 removed. 
PCI: 
  Matched:  28:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ca) 
  IOMMU Group:  20 
    Rebinding 1002:731f back to driver: amdgpu 
    Was unable to rebind it to amdgpu. 
  Matched:  28:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38] 
  IOMMU Group:  21 
    Rebinding 1002:ab38 back to driver: snd_hda_intel 
    Successfully rebound. 
Cleanup complete. 

Thank you so much for responding again and looking into this weird problem. I can at least confirm your script works perfectly if given the right conditions. Kernel version 5.15 definitely is a problem (as a read somewhere in some reddit thread as well).

shekhars-li commented 8 months ago

To your questions:

Does anything appear in dmesg after letting the command hang for a few minutes for kernel calls to start timing out? I see this (possibly) on attempt to rebind: [ 414.818151] amdgpu: probe of 0000:28:00.0 failed with error -22

Does your AMD gpu there on 0000:28 have any other subdevices which may need to also be unbound? lspci -D |grep 0000:28: should reveal any other sub-components of the graphics pci device. No. GPU is alone in a group (so is soundcard):

0000:28:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev ca)
0000:28:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio

What distro version and kernel version are you running there? 20.10. Now on 22.04. Was on 5.15 earlier, now on 6.7.0-060700-generic

What consumer motherboard or full server hardware product are you running this on? Consumer board - MSI Tomahawk 450.

It seems you specified -ignoreVtconn in the script run. It's possible the amdgpu driver is modesetting and the efi framebuffer could be holding on to the card. Could you try unbinding and rebinding the card again but after echo 0 > /sys/class/vtconsole/vtcon0/bind ; echo 0 > /sys/class/vtconsole/vtcon1/bind ; echo "vesa-framebuffer.0" > /sys/bus/platform/drivers/vesa-framebuffer/unbind ? This will stop the virtual consoles from drawing so you may need to SSH in from another machine to run these first. Let me try this. I just did a reset and my screen is blank again.

ipaqmaster commented 8 months ago

Yes when the usual unbind commands fail on their own it's indicative of some funky environment problem though I'm always looking for gotchas to add to the script for giving a heads up where it can. I'm glad the upgrade seems to have helped a little bit.

I can't run the script directly. It hangs up with black screen.

This could be a result of unbinding the virtual consoles and their framebuffers. SSH is the best way to debug vfio gpu stuff for a single gpu host if you need to read the output.

Windows does not see any network or graphics card. It loads basic display adapter. It may be because I need to connect to internet and download drivers?

Yes the guest needs AMD drivers to use its AMD gpu 🙂 but if it continues to give you problems after installing the drivers in the guest that can be looked into.

I'm not sure why its missing its network card. Perhaps the VirtIO drivers are not installed? You can try the script argument -avoidVirtio and possibly also -nvme (Though it seems it was able to boot already so you may not need the nvme argument)

[ 414.818151] amdgpu: probe of 0000:28:00.0 failed with error -22

I should have noticed this earlier but your AMD card is most definitely impacted by the reset bug leaving it unable to reset itself for re-initialization by a host (Or guest for that matter). It would be worth installing gnif's vendor-reset to see if that problem goes away for you.

If you already have your distro's build tools installed (And dkms + git) then this quick one-liner will fetch, compile and install it for you to try: cd ; git clone https://github.com/gnif/vendor-reset ; cd vendor-reset ; sudo dkms install .

This would definitely be a good idea for me to add and warn about in the script for when an AMD card with the reset problem is detected without vendor-reset installed.

shekhars-li commented 8 months ago

@ipaqmaster

Thanks again for your responses and for creating this beautiful script. :)

ipaqmaster commented 8 months ago

No worries at all

Ssorry I didn't realize I hit the comment+close button with that last reply. If you still need to bounce things off me feel free to re-open the issue.

It would be worth checking that the card appears under Device Manager in the VM - and noting and potential error codes it may have thrown being passed through. It may hint at something else to tweak

shekhars-li commented 8 months ago

@ipaqmaster No worries. The core of the issue is resolved now. I can reliably bootup and shutdown the VM with GPU handoffs and reset working just fine. The only problem that remains is that there is an error in Windows for my GPU - "windows has stopped this device (code 43)". I have tried everything and this one seems to not go away. If you have any ideas I can try, please let me know.

shekhars-li commented 8 months ago

If you have any ideas on dealing with the error windows throws for the GPU, please let me know. :) @ipaqmaster

shekhars-li commented 8 months ago

I found a couple of reddit posts (for AMD devices same series as mine) that solved their problem by passing a root PCI device like this:

-device pcie-root-port,bus=pci.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
-device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on \
-device vfio-pci,host=01:00.1,bus=root.1,addr=00.1 \

Since the script takes care of binding and unbinding (and quite reliably), I don't want to use qemu directly. How can I go about doing this with the script?

ipaqmaster commented 8 months ago

Code 43 is an annoying one for this series of GPU. Some have fixed it with only the vendor-reset solution and also making sure it's loading early by adding it to their host's initramfs module list. Others have had luck by removing x-vga=on, which can be patched out of this script with sed -i 's/x-vga=on,//g' ./main. And other times it just suddenly works.

It may also be worth hitting that PCI rom file you've got there (/usr/share/vgabios/Sapphire.RX5600XT.6144.200314.rom) with https://github.com/awilliam/rom-parser and making sure you have appropriately truncated if if needed, and whether you need one at all ---- Typically NVIDIA cards are the ones who truncate their own PCI rom making them initializable only once per boot - not AMD cards as far as I know.

But any GPU will throw a Code 43 if you use an unpatched or wrong bios rom of the card. It would be worth trying without specifying the -romfile

Otherwise there's no harm trying with libvirt to see if the virtual PCIe multifunction root port layout does the trick. I'm not in a position to get a version of that into the script right this minute but may be able to later.

ipaqmaster commented 8 months ago

When it Code 43's in the guest you should also check the host's dmesg log to make sure vendor-reset did its thing

ipaqmaster commented 6 months ago

Not sure there's much else I can help with here. This issue seems to be related to the local setup rather than the script. If you're still working on this and have any further updates I would be happy to keep looking into it with you as far as we can.