Support GPU Nodes in CAPI Clusters

In the Context of CAPZ:

In order to use GPUs in kubernetes we need

It is very old, last time it was touched is 2020
- Unsupported kubernetes API versions
- Old Driver Versions
the driver-install container does not work on current flatcar , see https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1481272356 for details
- Workaround i installed the drivers using the azure provided script , see below
the nvidia-gpu-device-plugin container does work and correctly register the device plugin on gpu nodes but
- It is not clear where the container come from , i only found https://github.com/giantswarm/retagger/pull/439/files and is very old ( 2019 ) with the last image in that repo from 2020
It does not use the nvidia container toolkit and does not setup plugin configuration in containerd/config.toml so applications might/will need to mount the lib directory with the nvidia libraries - see https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1481272356

Azure Flatcar ship a script ( /usr/share/oem/bin/setup-nvidia ) and a service ( nvidia.service) to install nvidia drivers on flatcar but it has some issues - see https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1470411882

I did try to install the Drivers with the provided script and then try to get the container toolkit to configure containerd/config.toml but i was not successful - https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1480019148

this can be fixed with a bit of work

Nvidia provides an operator to install the driver and toolkit and configure containerd - https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1481011318

It worked as expected on ubuntu capi cluster
On flatcar the support is somewhat spotty, the driver image is missing for the current version but it should be possible to update

We essentially need to decide and implement one ( or a mix ) of the following solutions

Pre Build the drivers in the image https://github.com/giantswarm/capi-image-builder/blob/main/helm/capi-image-builder/templates/pipelines/capz.yaml
- Immutable images the way we like it
- Means we won't use the NVIDIA Operator
- container toolkit could also be installed at build time
- at runtime we only need a service to load modules , configure containerd and register the device plugin when the node is a GPU node
Build drivers at node boot time
- Could use the NVIDIA Operator if we can fix the flatcar image missing issue
- Could use our app by updating it and fixing it
  - We could also add support for the container toolkit and configure containerd to make it easier for workloads to use the gpu without mounting the lib directory

Should we have a flag for gpu on the node pool and when set
- Add labels
- Add taints
- Add ignition configurations for triggering the gpu only service ( to replace the nvidia.service azure ships )

giantswarm / roadmap