Closed djeebus closed 1 year ago
One other note: the NVIDIA-Linux-x86_64-{{ .nvidia_driver_version }}-grid.run
package, if you wanted to pull it down, requires a vgpu license, which does have a 90 day trial. Alternately, if you wanted to see it w/o the hoops, I'm happy to send the driver file to you directly. It's ~337 MB.
Talos kernel is configured to only load kernel modules which are signed with a key. That key is never persisted, it's ephemeral, and lives only as part of the build process.
In other words, the way Talos pkgs
are built, the kernel is built first, and the signing key is generated, but it's only in the build cache. The next step picks up the build cache, and uses same key e.g. to sign NVIDIA modules.
Note: buildkit has configuration for GC policy, make sure it's configured so that the cache is not dropped between the steps.
After that, kernel
and some modules are packaged as part of the Talos installer
image (or any other boot assets), and some as extensions
. As the matching kernel & modules are signed with the same key, they can be loaded succesfully.
So in your case:
ah, ok, so in my case, the kernel is being built b/c it's a dependency of the kernel module pkg, but b/c it's not literally the same kernel that's actually running (since that comes from the talos
image, which is pulling the kernel from siderolabs
and not my own ghcr repo, that's where the mismatch ends. that makes sense ...
... which begs the question - can I create a kernel module extension w/o having to rebuild/deploy parallel versions of the kernel, talos, installer, and any sub package that those require? or is that just a rabbit hole that i'm going to have to dive down in order to get this to work?
teslas like T4 are supported by nonfree drivers, not sure if grid drivers are something different, if standard non-free drivers work, you can use the siderolabs published ones
... which begs the question - can I create a kernel module extension w/o having to rebuild/deploy parallel versions of the kernel, talos, installer, and any sub package that those require? or is that just a rabbit hole that i'm going to have to dive down in order to get this to work?
Short: No. Long: see my response above - nobody can sign the module with the key which used to sign any published Talos kernels, as the key is ephemeral.
Ah, ok. I think my scenario is a little more complicated than that. I've got a baremetal proxmox server w/ a Tesla P40 in it, and I'm running Talos in a few VMs w/ the Tesla P40 split up into multiple VGPUs. My understanding is that the P40 VGPUs require different drivers than the ones that are publicly available, and installing the nonfree extension seems to bear that out:
kern: info: [2023-09-14T03:08:37.389695757Z]: nvidia-nvlink: Nvlink Core is being initialized, major device number 239
kern: err: [2023-09-14T03:08:37.390536757Z]:
kern: info: [2023-09-14T03:08:37.409886757Z]: nvidia 0000:00:10.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
SUBSYSTEM=pci
DEVICE=+pci:0000:00:10.0
kern: warning: [2023-09-14T03:08:37.411164757Z]: NVRM: The NVIDIA GPU 0000:00:10.0 (PCI ID: 10de:1b38)\x0aNVRM: installed in this system is not supported by the\x0aNVRM: NVIDIA 535.54.03 driver release.\x0aNVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'\x0aNVRM: in this release's README, available on the operating system\x0aNVRM: specific graphics driver download page at www.nvidia.com.
kern: warning: [2023-09-14T03:08:37.433482757Z]: nvidia: probe of 0000:00:10.0 failed with error -1
kern: warning: [2023-09-14T03:08:37.434129757Z]: NVRM: The NVIDIA probe routine failed for 1 device(s).
kern: warning: [2023-09-14T03:08:37.434777757Z]: NVRM: None of the NVIDIA devices were initialized.
kern: info: [2023-09-14T03:08:37.435914757Z]: nvidia-nvlink: Unregistered Nvlink Core, major device number 239
user: warning: [2023-09-14T03:08:37.488789757Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"nvidia\x5c": load nvidia failed: no such device"}
I'll work on rebuilding everything, and report back when I get some results. Any chance you have the script used to build and publish releases available somewhere?
It's all in the Makefiles, .drone.yml, everything is completely in the source tree
the non-free drivers work fine on a100 and t4, and it's in the list mentioned so I would assume it works the same for P40 :thinking: :shrug:
I think this (click "Supported Productts") is the list of devices that are supported by the nonfree (but publicly available) drivers, where as this is the list of hardware supported by their not-publicly-available drivers.
Regardless, after taking your advice and rebuilding/pushing the world in one go from my desktop (pkgs, extensions, and talos itself), the error went away, and the driver's usable inside kubernetes, verified via an nvidia-smi
pod in-cluster. Thanks!
I think this (click "Supported Productts") is the list of devices that are supported by the nonfree (but publicly available) drivers, where as this is the list of hardware supported by their not-publicly-available drivers.
Regardless, after taking your advice and rebuilding/pushing the world in one go from my desktop (pkgs, extensions, and talos itself), the error went away, and the driver's usable inside kubernetes, verified via an
nvidia-smi
pod in-cluster. Thanks!
I see, it sad it's gated behind a license. Glad the issues are sorted
I'm working on creating an extension for the nvidia grid drivers (nvidia's open drivers don't support datacenter vgpus like the tesla line of cards). I've copied the
nonfree-kmod-nvidia
tree, in the hopes that they're similar enough that I can swap the linux installer and it would Just Work :tm: , but either I did something wrong or I'm missing some critical step in building and pushing the extensions.NOTE: One other thing that might complicate all this is that I'm running fairly old hardware that requires GOAMD64=v1. this is the process I use to have github actions build all the artifacts for me. It currently builds everything, but I'm fairly certain I only use the
installer
andtalos
images. I'm also currently on1.5.1
.build process
nonfree-kmod-nvidia-pkg
package and create the nonfree-kmod-nvidia-grid-pkg packagenonfree-kmod-nvidia
extension and create the nonfree-kmod-nvidia-grid extensioninstallation
create the following patch:
apply the patch via:
trigger a reboot:
the error message
After all that, I get the following pair of messages in
dmesg
after a reboot:Any advice you could give would be very welcome, thanks!