siderolabs / extensions

Talos Linux System Extensions
Mozilla Public License 2.0
121 stars 120 forks source link

How can I patch and build the open-gpu-kernel-modules extension to support the arm64 LX2160a platform? #529

Open asymingt opened 2 days ago

asymingt commented 2 days ago

I've managed to setup a Talos cluster with both amd64 and arm64 worker nodes. I have no issues running amd64 GPU jobs using the nonfree / production nvidia driver extension. There have been some sharp edges, but all-in-all I've had a pretty clean experience along the way, even though my Kubernetes knowledge is limited. Thank you!

The arm64 node is a Honeycomb LX2k board that is based around the LX2160s SOM and this requires a patch to the open-gpu-kernel-modules to function. This patch appears not to have made it into their master branch, and so don't think it is present in either the LTS or the production variant of the the Talos published extensions. A related issue is given here showing the OSS modules working with this patch. Before switching to Talos I was running containerized GPU images on this platform in Ubuntu 22.04 on an Ampere card without issues.

I checked out this repo thinking I might be able to apply a patch to the driver build script, but on closer inspection it appears like this repo actully stitches together prebuilt and signed artifacts from container registry ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules-*. Would it be possible to nudge me in the right direction to patch, build and sign my own OSS modules to produce an updated Talos extension, or is there an official process whereby Sidero Labs can supply a prebuilt image with the patch applied to get this platform supported by the drivers?

smira commented 17 hours ago

The best way is to submit a PR to pkgs repository with the patch.

asymingt commented 32 minutes ago

The best way is to submit a PR to pkgs repository with the patch.

Oh, thank you! I wasn't aware of the pkgs repo. I'll try my hand at patching and building there :+1: