Open primeroz opened 1 year ago
In the Context of CAPZ:
This is a follow up to the investigation in https://github.com/giantswarm/roadmap/issues/2104
In order to use GPUs in kubernetes we need
we do have a kubernetes-gpu-app , documented in https://docs.giantswarm.io/advanced/gpu/
driver-install
container does not work on current flatcar , see https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1481272356 for details
nvidia-gpu-device-plugin
container does work and correctly register the device plugin on gpu nodes but
nvidia container toolkit
and does not setup plugin configuration in containerd/config.toml
so applications might/will need to mount the lib
directory with the nvidia libraries - see https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1481272356Azure Flatcar ship a script ( /usr/share/oem/bin/setup-nvidia
) and a service ( nvidia.service
) to install nvidia drivers on flatcar but it has some issues - see https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1470411882
I did try to install the Drivers with the provided script and then try to get the container toolkit to configure containerd/config.toml
but i was not successful - https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1480019148
this can be fixed with a bit of work
Nvidia provides an operator to install the driver and toolkit and configure containerd
- https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1481011318
driver
image is missing for the current version but it should be possible to update We essentially need to decide and implement one ( or a mix ) of the following solutions
NVIDIA Operator
container toolkit
could also be installed at build timeload modules
, configure containerd
and register the device plugin
when the node is a GPU node NVIDIA Operator
if we can fix the flatcar image missing issue container toolkit
and configure containerd
to make it easier for workloads to use the gpu without mounting the lib
directoryflag
for gpu
on the node pool and when set
ignition
configurations for triggering the gpu only service ( to replace the nvidia.service azure ships )
Motivation
As GPUs are optimised for high performance multi step processes and parallel processing they have great benefits for customers AI and machine learning teams. Some customers already asked for GPU support in CAPI clusters, so we should add the option to go for CPU or GPU nodes when creating a workload cluster
Outcome
Context Hint