negativo17 / nvidia-kmod-common

NVIDIA's proprietary driver kernel module common files
4 stars 5 forks source link

545.29.02 intermittently fails to auto-login with "initcall_blacklist=simpledrm_platform_driver_init" removed #12

Closed zeroepoch closed 10 months ago

zeroepoch commented 11 months ago

Previous I had initcall_blacklist=simpledrm_platform_driver_init added in /etc/default/grub and a recent update of nvidia-kmod-common removes this option when the driver is updated. I couldn't figure out why it was so unreliable to login on boot and even sometimes failing to load the session when logging in manually. I couldn't get VTs to work either. Once I added back initcall_blacklist=simpledrm_platform_driver_init everything started to work as expected and VTs work and it logs in automatically reliably. I'm curious why this kernel option was removed and if others are seeing the same issue.

scaronni commented 11 months ago

That's weird. In the latest release the console framebuffer option was added:

$ tail -1 /etc/modprobe.d/nvidia-modeset.conf
options nvidia-drm modeset=1 fbdev=1

This binds a new console framebuffer using the Nvidia kernel module, so regardless of that boot parameter the console is taken over and the framebuffer driver replaced. Beside this, the workaround was not needed since quite some time.

Can you check your kernel command line for other spurious stuff (cat /proc/cmdline)? Also, did you customize the /etc/modprobe.d/nvidia-modeset.conf file?

scaronni commented 11 months ago

Let me rephrase, regardless of efifb, vesafb or simpledrm using your console, the driver is replaced, so that boot option that tells the kernel to use efifb instead of simpledrm should be completely useless.

zeroepoch commented 11 months ago

My system is a little weird and I'm not sure if it's my BIOS (x570 TUF Gaming), Monitor (Monoprice 4k IPS/HDR), or GPU (3090 Ti), but I only see the BIOS output and Grub menu when I do a cold boot. A reboot results in a blank screen until the desktop is shown. If I enable CSM then I don't have this odd problem, but then ReSize BAR doesn't work. This is actually the same for Windows and I see nothing until the tail end of the login process, so nothing to do with Linux that I can see. For this reason I collected logs both with a cool boot, where I think the EFI framebuffer is more properly initialized, and a reboot where it's blank at boot (monitor turns off for a short bit).

The main difference I can see in the case where initcall_blacklist=simpledrm_platform_driver_init is missing (cold boot or reboot) is the following line shows up a few times.

[drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to apply atomic modeset.  Error code: -22

In either case, with or without simpledrm initialized, I see the following in the reboot case (blank screen).

[drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Flip event timeout on head 0

Neither error shows up in the cold boot case with the kernel option added.

As mentioned earlier with initcall_blacklist=simpledrm_platform_driver_init it always ends up at the desktop, but the reboot case does take a little longer. With this kernel option omitted it sometimes stop at the login screen, sometimes logs in but all input is dead, and sometimes I can't login manually (X dies). In these "broken" cases I collected the logs by SSH'ing from my laptop to the desktop.

I attached the journalctl output for gdm-x-session for one of the bad boots. What's probably most relevant here is this:

/usr/libexec/gdm-x-session[1994]: (EE) NVIDIA(GPU-0): Failed to acquire modesetting permission.
/usr/libexec/gdm-x-session[1994]: (EE) NVIDIA(0): Failing initialization of X screen

Adding back initcall_blacklist=simpledrm_platform_driver_init when nvidia-kmod-common updates will be a little annoying, but at least it provides a stable workaround. If others are seeing improved compatibility with this kernel option removed then it makes sense to keep the logic you have now.

log.txt.gz

zeroepoch commented 10 months ago

@scaronni not sure what changed. Either kernel 6.6, or my more likely hunch is they fixed some related bug in 545.29.06. Anyways I'm not seeing these modesetting issues in journalctl on either a cold boot or reboot (where screen blanks until desktop loads). It also boots slightly faster (I think?) without the initcall disabled. Happy to have it resolved so I can leave the grub options as intended with initcall_blacklist=simpledrm_platform_driver_init removed. I'll go ahead and close this issue out.

scaronni commented 10 months ago

Thanks for feedback, I'm glad to hear it was solved with the update, not much i could do otherwise.