giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Support GPU Nodes in CAPI Clusters #2212

Open primeroz opened 1 year ago

primeroz commented 1 year ago

Motivation

As GPUs are optimised for high performance multi step processes and parallel processing they have great benefits for customers AI and machine learning teams. Some customers already asked for GPU support in CAPI clusters, so we should add the option to go for CPU or GPU nodes when creating a workload cluster

### Stories
- [ ] https://github.com/giantswarm/roadmap/issues/2104
- [ ] https://github.com/giantswarm/roadmap/issues/2346
- [ ] https://github.com/giantswarm/roadmap/issues/2347
- [ ] https://github.com/giantswarm/roadmap/issues/2348

Outcome

Context Hint

Rotfuks commented 1 year ago

In the Context of CAPZ:

Technical Hints

This is a follow up to the investigation in https://github.com/giantswarm/roadmap/issues/2104

In order to use GPUs in kubernetes we need

What do we already have in giantswarm

we do have a kubernetes-gpu-app , documented in https://docs.giantswarm.io/advanced/gpu/

What do we get out of the box on flatcar azure

Azure Flatcar ship a script ( /usr/share/oem/bin/setup-nvidia ) and a service ( nvidia.service) to install nvidia drivers on flatcar but it has some issues - see https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1470411882

Nvidia Container toolkit on flatcar

I did try to install the Drivers with the provided script and then try to get the container toolkit to configure containerd/config.toml but i was not successful - https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1480019148

this can be fixed with a bit of work

NVIDIA Operator

Nvidia provides an operator to install the driver and toolkit and configure containerd - https://github.com/giantswarm/roadmap/issues/2104#issuecomment-1481011318

Proposal

We essentially need to decide and implement one ( or a mix ) of the following solutions

  1. Pre Build the drivers in the image https://github.com/giantswarm/capi-image-builder/blob/main/helm/capi-image-builder/templates/pipelines/capz.yaml
    • Immutable images the way we like it
    • Means we won't use the NVIDIA Operator
    • container toolkit could also be installed at build time
    • at runtime we only need a service to load modules , configure containerd and register the device plugin when the node is a GPU node
  2. Build drivers at node boot time
    • Could use the NVIDIA Operator if we can fix the flatcar image missing issue
    • Could use our app by updating it and fixing it
      • We could also add support for the container toolkit and configure containerd to make it easier for workloads to use the gpu without mounting the lib directory

Cluster-Azure