siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.75k stars 540 forks source link

nvidia kernel module not found on reboot upon fresh install #9041

Closed achristianson closed 2 months ago

achristianson commented 3 months ago

On v1.7.5 running nvidia proprietary modules + container toolkit official plugins.

GPU worker nodes install and run fine, but there is a glitch on the first reboot upon install (e.g. node goes into "Installing" state then reboots). On that first boot after "Installing" phase is complete, I get errors about the nvidia kernel modules not being found. This is fixed by performing a hard reboot of the system.

Notably, this happens both with qemu/libvirt VMs as well as bare metal. I think it may have something to do with ACPI soft reboot vs hard reboot.

Edit:

A couple other things of note:

One could speculate this is due to some kind of hardware issue, but before running Talos we were running k3s on debian on bare metal with official nvidia drivers (latest ~555 from nvidia repos). In that setup, I was always able to perform a software-initiated reboot, e.g. running "reboot" as root, and the nvidia kernel modules always loaded fine on the next boot, so this points to something different about how Talos is booting.

I can't pull logs since the node is "Booting" and I only have access via IPMI. Here's a screenshot from the IPMI that shows one of the module missing errors:

image

EDIT 2:

I caught in the console during install process "kexec core: launching new kernel"

I believe what may be happening is when the kernel is reset with kexec, the GPUs are not hardware reset, and the driver fails to load. It could also possibly be an issue with kexec and locating/loading the nvidia modules.

smira commented 3 months ago

It might be the case with kexec, but the module should load still, might report a different error though if NVIDIA GPU is not found.

But I guess your case might be something related to the boot media (e.g. ISO) having GPU extension, while using an installer without one. So you boot initially with GPU, then install without GPU, kexec reboots from disk correctly, but GPU driver is missing.

If you hard reboot, you boot from an ISO vs. the disk once again.

achristianson commented 3 months ago

This is a PXE boot fresh install using matchbox following the Talos docs. The provisioning process in this case does a full wipe of the system, then on first boot post-wipe, Talos installer does its thing, system kexec reboots, and that's when we see the modules missing.

I believe I had the same thing happening on our libvirt deployments as well, which use x86-64 Talos ISO with a cloud-init ISO plugged in.

The workaround in our provisioning process is to detect Talos is ready, perform a reboot, then everything works fine from there, but it's a bit of a hack.

I should note we're using an image from the Talos image factory that has no customizations other than adding the proprietary nvidia drivers and container toolkit extensions.

smira commented 3 months ago

I don't have much to add here besides what I said above.

You can try disabling kexec and see if that fixes the issue for you.

machine:
  sysctls:
    kernel.kexec_load_disabled: "1"
smira commented 2 months ago

Installation flow from ISO/PXE requires NVIDIA to be present in the installer, (it doesn't matter if it's present in the PXE boot media).

See docs.