I've managed to setup a Talos cluster with both amd64 and arm64 worker nodes. I have no issues running amd64 GPU jobs using the nonfree / production nvidia driver extension. There have been some sharp edges, but all-in-all I've had a pretty clean experience along the way, even though my Kubernetes knowledge is limited. Thank you!
The arm64 node is a Honeycomb LX2k board that is based around the LX2160s SOM and this requires a patch to the open-gpu-kernel-modules to function. This patch appears not to have made it into their master branch, and so don't think it is present in either the LTS or the production variant of the the Talos published extensions. A related issue is given here showing the OSS modules working with this patch. Before switching to Talos I was running containerized GPU images on this platform in Ubuntu 22.04 on an Ampere card without issues.
I checked out this repo thinking I might be able to apply a patch to the driver build script, but on closer inspection it appears like this repo actully stitches together prebuilt and signed artifacts from container registry ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules-*. Would it be possible to nudge me in the right direction to patch, build and sign my own OSS modules to produce an updated Talos extension, or is there an official process whereby Sidero Labs can supply a prebuilt image with the patch applied to get this platform supported by the drivers?
I've managed to setup a Talos cluster with both amd64 and arm64 worker nodes. I have no issues running amd64 GPU jobs using the nonfree / production nvidia driver extension. There have been some sharp edges, but all-in-all I've had a pretty clean experience along the way, even though my Kubernetes knowledge is limited. Thank you!
The arm64 node is a Honeycomb LX2k board that is based around the LX2160s SOM and this requires a patch to the open-gpu-kernel-modules to function. This patch appears not to have made it into their master branch, and so don't think it is present in either the LTS or the production variant of the the Talos published extensions. A related issue is given here showing the OSS modules working with this patch. Before switching to Talos I was running containerized GPU images on this platform in Ubuntu 22.04 on an Ampere card without issues.
I checked out this repo thinking I might be able to apply a patch to the driver build script, but on closer inspection it appears like this repo actully stitches together prebuilt and signed artifacts from container registry
ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules-*
. Would it be possible to nudge me in the right direction to patch, build and sign my own OSS modules to produce an updated Talos extension, or is there an official process whereby Sidero Labs can supply a prebuilt image with the patch applied to get this platform supported by the drivers?