siderolabs / extensions

Talos Linux System Extensions
Mozilla Public License 2.0
123 stars 120 forks source link

new kernel module fails to insert with "key was rejected by service" #227

Closed djeebus closed 1 year ago

djeebus commented 1 year ago

I'm working on creating an extension for the nvidia grid drivers (nvidia's open drivers don't support datacenter vgpus like the tesla line of cards). I've copied the nonfree-kmod-nvidia tree, in the hopes that they're similar enough that I can swap the linux installer and it would Just Work :tm: , but either I did something wrong or I'm missing some critical step in building and pushing the extensions.

NOTE: One other thing that might complicate all this is that I'm running fairly old hardware that requires GOAMD64=v1. this is the process I use to have github actions build all the artifacts for me. It currently builds everything, but I'm fairly certain I only use the installer and talos images. I'm also currently on 1.5.1.

build process

installation

create the following patch:

# nvidia-vgpu.yaml
- op: add
  path: /machine/install/extensions
  value:
    - image: ghcr.io/djeebus/talos/nonfree-kmod-nvidia-grid:535.54.03-v1.5.1
- op: add
  path: /machine/kernel
  value:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
- op: add
  path: /machine/sysctls
  value:
    net.core.bpf_jit_harden: 1

apply the patch via:

talosctl  \
    --nodes $NODE \
    patch mc \
    --patch @nvidia-vgpu.yaml

trigger a reboot:

  talosctl \
    --nodes $NODE \
    upgrade --image=ghcr.io/djeebus/talos/installer:v1.5.1

the error message

After all that, I get the following pair of messages in dmesg after a reboot:

$NODE: kern:  notice: [2023-09-12T16:35:58.881790988Z]: Loading of module with unavailable key is rejected
$NODE: user: warning: [2023-09-12T16:35:58.887564988Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"nvidia\x5c": load nvidia failed: key was rejected by service"}

Any advice you could give would be very welcome, thanks!

djeebus commented 1 year ago

One other note: the NVIDIA-Linux-x86_64-{{ .nvidia_driver_version }}-grid.run package, if you wanted to pull it down, requires a vgpu license, which does have a 90 day trial. Alternately, if you wanted to see it w/o the hoops, I'm happy to send the driver file to you directly. It's ~337 MB.

smira commented 1 year ago

Talos kernel is configured to only load kernel modules which are signed with a key. That key is never persisted, it's ephemeral, and lives only as part of the build process.

In other words, the way Talos pkgs are built, the kernel is built first, and the signing key is generated, but it's only in the build cache. The next step picks up the build cache, and uses same key e.g. to sign NVIDIA modules.

Note: buildkit has configuration for GC policy, make sure it's configured so that the cache is not dropped between the steps.

After that, kernel and some modules are packaged as part of the Talos installer image (or any other boot assets), and some as extensions. As the matching kernel & modules are signed with the same key, they can be loaded succesfully.

So in your case:

djeebus commented 1 year ago

ah, ok, so in my case, the kernel is being built b/c it's a dependency of the kernel module pkg, but b/c it's not literally the same kernel that's actually running (since that comes from the talos image, which is pulling the kernel from siderolabs and not my own ghcr repo, that's where the mismatch ends. that makes sense ...

... which begs the question - can I create a kernel module extension w/o having to rebuild/deploy parallel versions of the kernel, talos, installer, and any sub package that those require? or is that just a rabbit hole that i'm going to have to dive down in order to get this to work?

frezbo commented 1 year ago

teslas like T4 are supported by nonfree drivers, not sure if grid drivers are something different, if standard non-free drivers work, you can use the siderolabs published ones

smira commented 1 year ago

... which begs the question - can I create a kernel module extension w/o having to rebuild/deploy parallel versions of the kernel, talos, installer, and any sub package that those require? or is that just a rabbit hole that i'm going to have to dive down in order to get this to work?

Short: No. Long: see my response above - nobody can sign the module with the key which used to sign any published Talos kernels, as the key is ephemeral.

djeebus commented 1 year ago

Ah, ok. I think my scenario is a little more complicated than that. I've got a baremetal proxmox server w/ a Tesla P40 in it, and I'm running Talos in a few VMs w/ the Tesla P40 split up into multiple VGPUs. My understanding is that the P40 VGPUs require different drivers than the ones that are publicly available, and installing the nonfree extension seems to bear that out:

kern:    info: [2023-09-14T03:08:37.389695757Z]: nvidia-nvlink: Nvlink Core is being initialized, major device number 239
kern:     err: [2023-09-14T03:08:37.390536757Z]: 
kern:    info: [2023-09-14T03:08:37.409886757Z]: nvidia 0000:00:10.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
 SUBSYSTEM=pci
 DEVICE=+pci:0000:00:10.0
kern: warning: [2023-09-14T03:08:37.411164757Z]: NVRM: The NVIDIA GPU 0000:00:10.0 (PCI ID: 10de:1b38)\x0aNVRM: installed in this system is not supported by the\x0aNVRM: NVIDIA 535.54.03 driver release.\x0aNVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'\x0aNVRM: in this release's README, available on the operating system\x0aNVRM: specific graphics driver download page at www.nvidia.com.
kern: warning: [2023-09-14T03:08:37.433482757Z]: nvidia: probe of 0000:00:10.0 failed with error -1
kern: warning: [2023-09-14T03:08:37.434129757Z]: NVRM: The NVIDIA probe routine failed for 1 device(s).
kern: warning: [2023-09-14T03:08:37.434777757Z]: NVRM: None of the NVIDIA devices were initialized.
kern:    info: [2023-09-14T03:08:37.435914757Z]: nvidia-nvlink: Unregistered Nvlink Core, major device number 239
user: warning: [2023-09-14T03:08:37.488789757Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"nvidia\x5c": load nvidia failed: no such device"}

I'll work on rebuilding everything, and report back when I get some results. Any chance you have the script used to build and publish releases available somewhere?

smira commented 1 year ago

It's all in the Makefiles, .drone.yml, everything is completely in the source tree

frezbo commented 1 year ago

the non-free drivers work fine on a100 and t4, and it's in the list mentioned so I would assume it works the same for P40 :thinking: :shrug:

djeebus commented 1 year ago

I think this (click "Supported Productts") is the list of devices that are supported by the nonfree (but publicly available) drivers, where as this is the list of hardware supported by their not-publicly-available drivers.

Regardless, after taking your advice and rebuilding/pushing the world in one go from my desktop (pkgs, extensions, and talos itself), the error went away, and the driver's usable inside kubernetes, verified via an nvidia-smi pod in-cluster. Thanks!

frezbo commented 1 year ago

I think this (click "Supported Productts") is the list of devices that are supported by the nonfree (but publicly available) drivers, where as this is the list of hardware supported by their not-publicly-available drivers.

Regardless, after taking your advice and rebuilding/pushing the world in one go from my desktop (pkgs, extensions, and talos itself), the error went away, and the driver's usable inside kubernetes, verified via an nvidia-smi pod in-cluster. Thanks!

I see, it sad it's gated behind a license. Glad the issues are sorted