Add action to add NVIDIA DRM driver to initramfs

crawfxrd commented 2 years ago

Fixes rebooting a system in clamshell mode and having the decryption prompt show up on an external display attached to a dGPU port.

Apply to known affected models:

oryp6
oryp9
oryp10

Ref: https://github.com/pop-os/plymouth-theme/issues/22#issuecomment-1022330320

leviport commented 2 years ago

This doesn't seem to be working on oryp6. Both before and after adding this patch, rebooting in clamshell just leaves the external monitor blank and I have to open the lid to see the prompt.

XV-02 commented 2 years ago

This does not alter behaviour on either the Kudu6, nor the Oryp9. Log_NvidiaInitramFS_Kudu6.txt Log_NvidiaInitramFS_Oryp9.txt

On Kudu6, the external monitor is blank and the lid must be opened to see the decrypt prompt

On Oryp9, the decrypt prompt displays, but continues to fail to resolve to a user login after the reported successful decrypt.

crawfxrd commented 2 years ago

Might have to be nvidia-drm instead of just nvidia. Is it in the initramfs?

lsinitramfs /boot/initrd.img-$(uname -r) | grep nvidia

XV-02 commented 2 years ago

I'm not seeing nvidia-drm in the initramfs

Also, I noticed this error was present when trying to boot with an external monitor and closed lid: kernel: ACPI Warning: \_SB.PCI0.PEG2.DEV0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20211217/nsarguments-61)

crawfxrd commented 2 years ago

Changed it to the DRM module. Should work now.

leviport commented 2 years ago

This is working great on oryp6 now. I'll let @XV-02 finish up any testing he's still working on, but I'm happy with it.

XV-02 commented 2 years ago

So, this is changing behaviour on both the Kudu6 and Oryp9

Kudu6

On Kudu6: Where, previously, the decrypt screen wasn't showing, it now is. After entering the decryption password, the external monitor goes to idle, reporting a loss of signal. Opening the laptop's lid highlights the following error messages:

[ 104.830643] Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 124.850282] Freezing of tasks failed after 20.009 seconds (2 tasks refusing to freeze, wq_busy=0):
[ 125.514629] [drm:nv_drm_atomic_commit [nvidia_drm] *ERROR* [nvidia_srm] [GPU ID 0x00000100] Failed to apply atomic modeset. Error code: -22
[ 125.514697] [drm:nv_drm_master_set [nvidia_drm] *ERROR* [nvidia_srm] [GPU ID 0x00000100] Failed to grab modeset ownership
[ 152.551335] Freezing of tasks failed after 20.001 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 172.570958] Freezing of tasks failed after 20.009 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 173.235418] [drm:nv_drm_atomic_commit [nvidia_drm] *ERROR* [nvidia_srm] [GPU ID 0x00000100] Failed to apply atomic modeset. Error code: -22
[ 179.759305] [drm:nv_drm_atomic_commit [nvidia_drm] *ERROR* [nvidia_srm] [GPU ID 0x00000100] Flip event timeout on head 0

And, snooping the journal log, I found:

Jul 21 12:12:44 pop-os kernel: Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
Jul 21 12:12:44 pop-os kernel: task:plymouthd       state:D stack:    0 pid:  416 ppid:     1 flags:0x00004006
Jul 21 12:12:44 pop-os kernel: Call Trace:
Jul 21 12:12:44 pop-os kernel:  <TASK>
Jul 21 12:12:44 pop-os kernel:  __schedule+0x246/0x5c0
Jul 21 12:12:44 pop-os kernel:  schedule+0x55/0xc0
Jul 21 12:12:44 pop-os kernel:  rwsem_down_read_slowpath+0x30f/0x360
Jul 21 12:12:44 pop-os kernel:  down_read+0x43/0x90
Jul 21 12:12:44 pop-os kernel:  nvkms_ioctl_from_kapi+0x2d/0x90 [nvidia_modeset]
Jul 21 12:12:44 pop-os kernel:  _nv000019kms+0x691/0x7f0 [nvidia_modeset]
Jul 21 12:12:44 pop-os kernel:  ? nv_drm_atomic_apply_modeset_config.isra.0+0x2e7/0x4c0 [nvidia_drm]
Jul 21 12:12:44 pop-os kernel:  ? nv_drm_atomic_apply_modeset_config.isra.0+0x441/0x4c0 [nvidia_drm]
Jul 21 12:12:44 pop-os kernel:  ? nv_drm_atomic_commit+0xba/0x340 [nvidia_drm]
Jul 21 12:12:44 pop-os kernel:  ? drm_atomic_check_only+0x1a4/0x400 [drm]
Jul 21 12:12:44 pop-os kernel:  ? drm_atomic_commit+0x58/0x60 [drm]
Jul 21 12:12:44 pop-os kernel:  ? nv_drm_atomic_helper_disable_all+0x1b3/0x2a0 [nvidia_drm]
Jul 21 12:12:44 pop-os kernel:  ? nv_drm_master_drop+0x28/0x60 [nvidia_drm]
Jul 21 12:12:44 pop-os kernel:  ? drm_dropmaster_ioctl+0xd9/0x150 [drm]
Jul 21 12:12:44 pop-os kernel:  ? drm_setmaster_ioctl+0x1d0/0x1d0 [drm]
Jul 21 12:12:44 pop-os kernel:  ? drm_ioctl_kernel+0xb8/0x150 [drm]
Jul 21 12:12:44 pop-os kernel:  ? drm_ioctl+0x265/0x4b0 [drm]
Jul 21 12:12:44 pop-os kernel:  ? drm_setmaster_ioctl+0x1d0/0x1d0 [drm]
Jul 21 12:12:44 pop-os kernel:  ? rseq_get_rseq_cs.isra.0+0x1b/0x240
Jul 21 12:12:44 pop-os kernel:  ? rseq_ip_fixup+0x72/0x1a0
Jul 21 12:12:44 pop-os kernel:  ? __x64_sys_ioctl+0x92/0xd0
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x5c/0x80
Jul 21 12:12:44 pop-os kernel:  ? exit_to_user_mode_prepare+0x92/0xb0
Jul 21 12:12:44 pop-os kernel:  ? syscall_exit_to_user_mode+0x26/0x50
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? syscall_exit_to_user_mode+0x26/0x50
Jul 21 12:12:44 pop-os kernel:  ? __x64_sys_read+0x19/0x20
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? do_syscall_64+0x69/0x80
Jul 21 12:12:44 pop-os kernel:  ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 21 12:12:44 pop-os kernel:  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Jul 21 12:12:44 pop-os kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
Jul 21 12:12:44 pop-os kernel:  </TASK>

Which corresponds to the first of the errors that flashed on the screen before it resolved to gdm3. That starts at line 1729 of this log: Kudu6_f073b43.txt

I would categorize this as an improvement over previous behaviour, which never showed the plymouth prompt on the external monitor.

Oryp9

On the Oryp9, I'm managing to get logged in, but only when I escape out of the graphical interface for Plymouth. Even then, the Oryp9 seems to suspend between Plymouth decrypting and the login screen, and between the login screen and the actual user session. Using the graphical interface, the Oryp9 ~~neither goes into suspend, nor does it reach the login screen after entering the correct decrypt password. ~~ does the same. [edit]

I pulled logs for the session and for plymouth for one of the non-graphical decrypt boots: Oryp9_f073b43.txt plymouth-debug-log.txt

I'm in the midst of pulling a set for a graphical attempt, but had to reinstall the Oryp9 as the drive being used was required for other testing.

leviport commented 2 years ago

Don't non-Nvidia systems also suspend after decrypting when in clamshell mode? That might not be related to this PR.

XV-02 commented 2 years ago

I'm fairly certain this moves us past the initial problem, yes. At this point, though, I don't want to approve without a more qualified set of eyes looking and confirming that this shifts the 'problem' out of the initramfs area.

Logs for graphical plymouth on Oryp9. Note, I was mistaken; behaviour is the same with graphical or non-graphical, just slower. plymouth-debug-log-graphic.txt Oryp9_Graphical.txt

XV-02 commented 2 years ago

I'd call this functional for Oryp9. Kudu6 is still having issues, and I don't have Oryp6 to test against.

For Oryp9's sake, I'd argue for excerpting the Kudu6, and potentially the Oryp6 if it's not resolved, so that we can approve this and move forward with those on their own.

leviport commented 2 years ago

Sounds good, please approve

crawfxrd commented 2 years ago

[ 125.514629] [drm:nv_drm_atomic_commit [nvidia_drm] *ERROR* [nvidia_srm] [GPU ID 0x00000100] Failed to apply atomic modeset. Error code: -22

I wonder if this is because of modeset=1? Maybe requires adding nvidia-modeset as well. But this is long after boot?

XV-02 commented 2 years ago

It's after the password has been accepted by plymouth. The password is taken, and the external monitor blanks, but all other behaviour indicates that the decryption has been successful. I would consider that long after boot.

leviport commented 2 years ago

Keypresses wake it up though, right? One would likely be using a keyboard if the machine is in clamshell mode, since the internal keyboard cannot be accessed in clamshell mode.

XV-02 commented 2 years ago

It doesn't hit the suspend state, so a keypress doesn't wake it.

leviport commented 2 years ago

Gotcha. Different from oryp6 then.

XV-02 commented 2 years ago

To add clarification, I can open the lid, and get to a user session. No matter at what point the screen blank presents. Additionally, the blank always occurs after the decrypt screen. On one occasion it happened after decrypt and login. Without full disk encryption, the Kudu6 resolves to the gdm login screen, and after login the external monitor blanks and it spits out the same errors before eventually hitting the user's session. If I try to get to a TTY session once the external monitor is blanked, then no matter what I try, the Kudu6 hits a black screen - internal and external - and doesn't resolve a user session.

leviport commented 2 years ago

Fixes plymouth screen on external displays on mira-b3 and oryp9. I will test my oryp6 when I get a chance.

XV-02 commented 2 years ago

The decrypt prompt now displaying on an external display for Oryp10 when the lid is shut on boot. However, GDM is never reached. After opening the lid, and after a prolonged wait, the following sets of errors show up:

Freezing of tasks failed after 20.005 seconds (2 tasks refusing to freeze, wq_busy=0); This displayed with a small range of time values spread over a couple of seconds.
watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [nvidia-sleep.sh:2340]

Waiting for a number of minutes (>10) the screen was still blank, however changing the TTY session (to TTY2) did yield a login prompt. After entering a user name, a password was never requested, and instead, the following errors/messages were noted:

watchdog: BUG: soft lockup - CPU#4 stuck for 649s! [nvidia-sleep.sh:2340] This message repeats most often, about evey 25 seconds.

INFO: task khugepaged:146 blocked for more than 724 seconds.
              Tainted:   P                      W     OEL        5.19.0-76051900-generic #202207312230~1660780566~22.04~9d60db1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

This repeats every 121 seconds.

XV-02 commented 2 years ago

On a different boot (This time on AC) with Oryp10 I had slightly different messages. Only (1 tasks refusing to freeze, wq_busy=0) this boot. New messages: 1. [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0 2. nvidia-modeset: ERROR: GPU:0: Failed to bind display engine notify context DMA: 0x1a (Ran out of a critical resource, other than memory [NV_ERR_INSUFFICIENT_RESOURCES]) 3. nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer

leviport commented 2 years ago

Looks good on my oryp6 as well. It also fits on my normal-sized ESP:

$ df -h /boot/efi/
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  511M  452M   60M  89% /boot/efi

XV-02 commented 2 years ago

Oryp10JournallogNvidiaInitramfs.txt

This is the log from a hung boot.

I'm seeing a wide range of different messages getting spit out from the log while the system hangs. For example, this picture of the screen after several minutes of hanging, seems to capture lines from 1644 of the log through at least 2284, and potentially up to 2339. This is making suspect that the previous messages might be slightly random as well, though the fact they were related to nvidia makes me think they're still relevant.

IMG_20220902_102504

XV-02 commented 2 years ago

Setting aside for the moment Oryp10 - as it's yet to be released - this PR does have a notable issue: As far as I can tell, there is no safe guard preventing the attempt to add the nvidia-driver to the initramfs when there is insufficient space in the EFI partition.

In and of itself the fact is that the vast, vast, vast majority of Pop!_OS installs have a sparsely populated EFI partition and this shouldn't be an issue - two kernel revisions together are comfortably less than 300 MB. However, there are two factors we cannot at current control: The size of the nvidia-driver and other OS installs sharing the EFI partition.

When updating through Pop-Shop, and I would expect also via command line, the system will attempt to write the initramfs with the nvidia-driver, completely fill the EFI partition, and render the system unbootable for the average user. Given the size of our user base, this fringe situation could break a not-small number of systems. It might also introduce a future hurdle for approving kernels or nvidia releases.

While some description of test (Check for 1GiB EFI partiton potentially) could be introduced, it would represent an expansion of this PR beyond its original scope. It also might be just kicking the can down the road.

13r0ck commented 2 years ago

superseded by https://github.com/pop-os/nvidia-graphics-drivers/pull/160

pop-os / system76-driver

Add action to add NVIDIA DRM driver to initramfs #246

Kudu6

Oryp9