system76 / firmware-open

System76 Open Firmware
Other
957 stars 86 forks source link

2023/09/08 firmware breaks nvidia drivers with RmInitAdaptor failed error! #528

Open alexispurslane opened 7 months ago

alexispurslane commented 7 months ago

I just recently updated from a 2022 BIOS to the latest BIOS version via firmware-manager (from the COPR system76 repo, layered via rpm-ostree), and now even with Secure Boot disabled (or enabled and with a properly signed OS and nvidia driver, with the signing keys enrolled via mokutil), whenever I boot:

  1. everything that tries to use nvidia (like nvidia-smi) reports that there is no GPU to talk to
  2. My external screen is black, indicating that my integrated graphics card is being used to render my desktop
  3. There are errors like these below in dmesg and journalctl:
[ 104.658925] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1589)
[  104.659702] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  104.711305] rfkill: input handler disabled
[  111.870516] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1589)
[  111.871336] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  119.073932] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1589)
[  119.074681] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  119.230675] systemd-journald[1181]: /var/log/journal/79a57eb3ebdd4db9ba51854cd4696e54/user-1000.journal: Journal file uses a different sequence number ID, rotating.
[  126.271736] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1589)
[  126.272531] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

I've also noticed these errors from earlier in that log:

[   57.036545] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   57.037431] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[   57.037558] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device

I also saw this, which doesn't make any sense because I have enrolled the keys the nvidia drivers were signed with, and I double checked that using mokutil to list the enrolled keys and then modinfo to check what the nvidia module was signed with:

[   18.961827] nvidia: loading out-of-tree module taints kernel.
[   18.961833] nvidia: module license 'NVIDIA' taints kernel.
[   18.961833] Disabling lock debugging due to kernel taint
>>> [   18.961835] nvidia: module verification failed: signature and/or required key missing - tainting kernel <<<
[   18.961836] nvidia: module license taints kernel.
[   19.055599] Generic FE-GE Realtek PHY r8169-0-3000:00: attached PHY driver (mii_bus:phy_addr=r8169-0-3000:00, irq=MAC)
[   19.104499] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[   19.106144] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
[   19.157144] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.14  Thu Feb 22 01:44:30 UTC 2024
[   19.213972] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.

Here's the full dmesg log: https://paste.centos.org/view/cf63c092.

Steps to reproduce

  1. Update firmware with firmware manager
  2. Reboot.

Expected behavior

  1. Update firmware
  2. Reboot
  3. Find the system working as it did before.

Actual behavior

  1. Update firmware
  2. Reboot
  3. Nvidia drivers are borked

Additional info

In order to make absolutely sure this wasn't a problem with my install, I reset back to completely vanilla Silverblue and reinstalled silverblue-nvidia several times.

alexispurslane commented 7 months ago

I tried to boot into a live PopOS image and all the same symptoms presented identically, so I know it isn't silverblue's fault here.

alexispurslane commented 7 months ago

Update:

I was able to narrow the source of the problem down a little bit. The problem only occurs (whether on a PopOS live image or in Fedora Silverblue) when the system is booted with an external monitor plugged in. If I do that, I get all the symptoms above.

If I boot the system (in Hybrid graphics mode) without an external monitor plugged in, and then plug it in after booting, my nvidia card (and even Wayland on nvidia!) works fine.

alexispurslane commented 7 months ago

Booting into my GNOME Wayland session with my external display plugged in with a Mini DisplayPort instead of HDMI leads to my internal screen being black (despite me booting into Hybrid mode, which should mean Wayland's lack of ability to switch the mux on the internal display doesnt matter, because the iGPU is sending everything to the internal display anyway), but my Nvidia card being accessible and working properly otherwise.