gnif / vendor-reset

Linux kernel vendor specific hardware reset module for sequences that are too complex/complicated to land in pci_quirks.c
GNU General Public License v2.0
764 stars 65 forks source link

Instinct MI100 cluster fails to reset on restart #80

Open TNT3530 opened 7 months ago

TNT3530 commented 7 months ago

ProxMox 7.3-3, Kernel 5.15.53-1-pve

applied the changes here to get it functioning with this kernel, double checking that all PCIe device reset_method values are correctly device_specific

First guest boot shows image but all GPUs pass through fine

Attempting to shutdown and restart the guest causes this: image ending in the guest failing to boot with atombios stuck in loop for more than 20secs aborting image

gnif commented 6 months ago

Your method of setting the reset to device specific is not supported, you are supposed to use the udev rules as provided in the project. Your service may be running too late and the inbuilt reset may have already been used at some point during boot.

If this does not solve the problem, I am sorry but there is not much else we can do here.

TNT3530 commented 6 months ago

I have the dkms module loaded in the proxmox host image

and activated in my /etc/modules image

with the service disabled, here is the initial boot image

And all GPUs pass-through fine.

Upon restarting in the guest, this is what spits out image

searching dmesg | grep reset returns nothing other than the above and a few USB devices, and dmesg | grep vfio has no new lines so i assume it isn't running

Moving the vendor-reset in /etc/modules to the first line does the same thing as above, but with the bonus of image