kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.62k stars 1.07k forks source link

[RFC] Unified framework with CDI for diverse vendor GPUs in Kata #9561

Open Apokleos opened 7 months ago

Apokleos commented 7 months ago

Issues with Kata using Diverse Vendor GPUs

The diverse usage patterns across different GPU vendors introduce significant compatibility challenges for users employing Kata with GPUs from various manufacturers. This necessitates extensive adaptation efforts, hindering both user experience and Kata's widespread adoption. Building on the success of Kata's Container Device Interface (CDI) integration with NVIDIA GPUs, we propose a unified and simplified approach to address this challenge.

Goals:

A unified framework promotes compatibility for diverse vendor GPUs, reducing the need for custom adaptations

Solution:

  1. Introduce a designated path "kata-linux-gpu-drivers" to create soft links for various GPU drivers and point them to this path. Create a soft link from different GPU driver paths to this designated target path. This path is located in /opt/kata/share/kata-linux-gpu-drivers
  2. Introduce kata-ctl cdi subcommands to manage GPU drivers. These subcommands will read files from the designated path, construct file lists based on template files, and generate CDI configuration files.
  3. Leverage containerd/CDI to modify OCI spec. This step will integrate CDI configuration into the OCI specification which containerd/cdi is responsible for the conversion work in this section.
  4. Utilize sandbox bind mounts to share files with the guest for tasks like kernel loading and service configuration that must be performed within the guest environment. This method will make the *.ko files or services configuraions accessible within the guest environment.
  5. Employ VM's prestart hooks to do load kernel modules or perform other must services. These hooks will trigger the loading of necessary kernel modules and other services before creating containers.
  6. Leverage Kata's share fs volume capability alongside CDI configuration files.This approach will share host files (binaries, libraries) with Kata containers using CDI configuration files.

Key issues to address:

1. How to generate GPU driver configurations ? As we all know, not all GPU driver files need to be shared with the container environment. To address this, we provide a customizable template file with limited extensibility which acts as a filter. By comparing entries in this template file, we can effectively filter out unnecessary parts, ensuring that the generated configuration only includes essential content. We'll use NVIDIA GPU Driver as an example to illustrate. A possible template file for NVIDIA GPU drivers looks like as below:

# Copyright (c) 2024 Ant Group
# 
# SPDX-License-Identifier: Apache-2.0
# 
# template.yaml for GPU Driver

libs:
  video:
  - libvdpau_nvidia.so # NVIDIA VDPAU ICD
  - libnvidia-encode.so # Video encoder
  # - libnvidia-opticalflow.so # NVIDIA Opticalflow library
  - libnvcuvid.so # Video decoder
  utility:
  - libnvidia-ml.so # Management library
  - libnvidia-cfg.so # GPU configuration
  - libnvidia-nscq.so # Topology info for NVSwitches and GPUs
  compute:
  - libcuda.so # CUDA driver library
  # - libcudadebugger.so # CUDA Debugger Library
  - libnvidia-opencl.so # NVIDIA OpenCL ICD
  # - libnvidia-gpucomp.so # Shared Compiler Library
  # - libnvidia-ptxjitcompiler.so # PTX-SASS JIT compiler (used by libcuda)
  # - libnvidia-fatbinaryloader.so # fatbin loader (used by libcuda)
  - libnvidia-allocator.so # NVIDIA allocator runtime library
  - libnvidia-compiler.so # NVVM-PTX compiler for OpenCL (used by libnvidia-opencl)
  # - libnvidia-pkcs11.so # Encrypt/Decrypt library
  # - libnvidia-pkcs11-openssl3.so # Encrypt/Decrypt library (OpenSSL 3 support) 
  - libnvidia-nvvm.so # The NVVM Compiler library
  graphics:
  # - libnvidia-egl-wayland.so # EGL wayland platform extension (used by libEGL_nvidia)
  - libnvidia-eglcore.so # EGL core (used by libGLES*[_nvidia] and libEGL_nvidia)
  - libnvidia-glcore.so # OpenGL core (used by libGL or libGLX_nvidia)
  - libnvidia-tls.so # Thread local storage (used by libGL or libGLX_nvidia)
  - libnvidia-glsi.so # OpenGL system interaction (used by libEGL_nvidia)
  - libnvidia-fbc.so # Framebuffer capture
  - libnvidia-ifr.so # OpenGL framebuffer capture
  - libnvidia-rtcore.so # Optix
  - libnvoptix.so # Optix
  graphics_glvnd:
  # - libGLX.so",                      # GLX ICD loader 
  # - libOpenGL.so",                   # OpenGL ICD loader 
  # - libGLdispatch.so",               # OpenGL dispatch (used by libOpenGL, libEGL and libGLES*) 
  - libGLX_nvidia.so # OpenGL/GLX ICD
  - libEGL_nvidia.so # EGL ICD
  - libGLESv2_nvidia.so # OpenGL ES v2 ICD
  - libGLESv1_CM_nvidia.so # OpenGL ES v1 common profile ICD
  - libnvidia-glvkspirv.so # SPIR-V Lib for Vulkan
  - libnvidia-cbl.so # VK_NV_ray_tracing
  graphics_compat:
  - libGL.so # OpenGL/GLX legacy _or_ compatibility wrapper (GLVND)
  - libEGL.so # EGL legacy _or_ ICD loader (GLVND)
  - libGLESv1_CM.so # OpenGL ES v1 common profile legacy _or_ ICD loader (GLVND)
  - libGLESv2.so # OpenGL ES v2 legacy _or_ ICD loader (GLVND)
  ngx:
  - libnvidia-ngx.so # NGX library
  dxcore:
  - libdxcore.so # Core library for dxcore support
bins:
  utility:
  - nvidia-smi                      # System management interface
  - nvidia-debugdump                # GPU coredump utility
  - nvidia-persistenced             # Persistence mode utility
  - nv-fabricmanager                # NVSwitch fabric manager utility
  # - nvidia-modprobe                 # Kernel module loader
  # - nvidia-settings                 # X server settings
  # - nvidia-xconfig                  # X xorg.conf editor
  compute:
  - nvidia-cuda-mps-control         # Multi process service CLI
  - nvidia-cuda-mps-server          # Multi process service server

Then based on this template, we need to navigate to the GPU driver directory and perform file search matching. During driver file searching, entries in the template file are matched against the discovered files, and only those that matched are retained, ensuring that the newly generated driver-config.yaml incorporates solely the essential components.

In this design, we do require such intermediate file named driver-config.yaml, which serves as the input for CDI conversion. Since different scenarios may involve distinct generation methods, the primary approaches currently employed are as follows:

2. How to generate CDI configurations ? To address this issue, The CDI Specification has been introduced in kata-ctl gpu subcommand to help convert driver specifications into CDI specifications. Building upon the solution to the first key issue, we will proceed to convert the GPU Driver config data structure to the corresponding CDI Spec data structure and utilize this conversion as the foundation to generate the expected CDI configuration file. During the conversion process, we perform corresponding conversions based on the type of each item in the driver config. For instance, for library files and executable files, we need to construct a Mount, properly configure containerPath and hostPath information, and set any required environment variables.

...
  env:
  - NVIDIA_VISIBLE_DEVICES=void
  - LD_PRELOAD=/usr/lib64/libnvidia-ml.so.535.54.03

  - containerPath: /usr/lib64/libnvidia-ml.so.535.54.03
    hostPath: /opt/kata/share/kata-linux-gpu-drivers/libnvidia-ml.so.535.54.03
    type: bind
    options:
    - ro
    - nosuid
    - nodev
    - bind

  - containerPath: /usr/bin/nvidia-smi
    hostPath: /opt/kata/share/kata-linux-gpu-drivers/nvidia-smi
    type: bind
    options:
    - ro
    - nosuid
    - nodev
    - bind
...

3. Are there tools available to assist with the conversion of config and cdi? Yes, of course. To address this, new subcommands gpu gen-cfg and gpu gen-cdi will be introduced to the too kata-ctl to facilitate the conversion process for GPU drivers. But they're not all.

4. How to set vfio Device (GPU) assigned to Kata runtime in CDI config ? This task requires collaboration between the device-plugin or other components and kata-ctl. The device-plugin or other components should provide the IOMMU Group ID and other relevant information of the allocated GPU devices to the kata-ctl gpu commands. These commands will then be responsible for completing the devices section in the CDI configuration. An example of the devices configuration is as follows:

...
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/vfio/75
  name: "0"
- containerEdits:
    deviceNodes:
    - path: /dev/vfio/76
  name: "1"
- containerEdits:
    deviceNodes:
    - path: /dev/vfio/75
    - path: /dev/vfio/76
  name: all
...

Reference

CDI nvc_info.c

What's your opinions about this RFC, comments please !

Apokleos commented 7 months ago

cc @lifupan @zvonkok

zvonkok commented 7 months ago

We need to sync on this @lifupan @Apokleos @bergwolf. This will introduce complexities and other "problems" that have already been solved. There is a far easier way to accomplish this, and you're tying proprietary logic to Kata. Let me follow up with a concrete example next week how go-runtime solves it. Kata shouldn't be responsible for creating the inner runtime CDI config, vendor should, Kata should not load any module, vendor should, especially in the CoCo use-case where we have some ordering of tools and modules.

We do not want to change the logic everytime new HW is released.

Prestart hooks shouldn't be used at a minimuml, this is the architecture we had many many years ago and introduced a lot of headache. The workload decides what tools/libraries/ are available in the container not Kata. What is meant by the host? host == guest VM? Kata should not be responsible for creating any links, modifications to the vendor files this is accomplished by CDI.

We have a unified way of doing this, let me give you an overview next week and we can iterate over it.

Besides the regular containers we also have management containers that need different CDI specs, and may change. We shouldn't push any vendor specific logic into Kata, this is where CDI is helping.

lifupan commented 7 months ago

Hi @zvonkok

Sure, we really don’t want to introduce the vendor’s private logic to kata, but want to make a clear separation between kata and vendor; in addition, we want the vendor to do as little work as possible and to be able to integrate well with kata. If you have a better way, we can discuss it together.

Apokleos commented 7 months ago

We need to sync on this @lifupan @Apokleos @bergwolf. This will introduce complexities and other "problems" that have already been solved. There is a far easier way to accomplish this, and you're tying proprietary logic to Kata. Let me follow up with a concrete example next week how go-runtime solves it. Kata shouldn't be responsible for creating the inner runtime CDI config, vendor should, Kata should not load any module, vendor should, especially in the CoCo use-case where we have some ordering of tools and modules.

We do not want to change the logic everytime new HW is released.

Prestart hooks shouldn't be used at a minimuml, this is the architecture we had many many years ago and introduced a lot of headache. The workload decides what tools/libraries/ are available in the container not Kata. What is meant by the host? host == guest VM? Kata should not be responsible for creating any links, modifications to the vendor files this is accomplished by CDI.

Host is not guest VM, is the node where kata pod and containerd running on. In this solution, there's only one outer runtime CDI(containerd/CDI) will be involved.

We have a unified way of doing this, let me give you an overview next week and we can iterate over it.

Besides the regular containers we also have management containers that need different CDI specs, and may change. We shouldn't push any vendor specific logic into Kata, this is where CDI is helping.

Thx @zvonkok some questions about NVIDIA GPU with CDI:

  • Can GPUs from other vendors use existing CDI-related tools to generate CDI configs ?
  • If NVDIA GPU drivers are not installed on host, can we use cdi-related tools to generate CDI configs?
  • If CDI's meta data(or CDI spec annotations) can be mapped with OCI Spec annotations in containerd/CDI processing stage ?
zvonkok commented 7 months ago

Having the drivers on the host will break several use cases, including Confidential Containers and vGPU. The drivers should not be installed on the host. In the vGPU use case, we have host drivers and guest drivers.

In the case of Confidential Containers, the PF and guest-drivers VF drivers are mandatory in the VM, not on the host.

The outer runtimes handle the VFIO devices, and depending on thevfio_mode={guest-kernel, vfio} , the kata-agent handles the "actual" device or VFIO group.

The outer runtime takes the VFIO device checks what type it is and annotates the OCI spec with a vendor CDI annotation e.g.

cdi.k8s.io/vfio17: nvidia.com/gpu=0
cdi.k8s.io/vfio17: nvidia.com/gpu=1

or for Intel, AMD gpus:

cdi.k8s.io/vfio22: intel.com/gpu=0
cdi.k8s.io/vfio22: amd.com/gpu=0

The index is dependent on the PCIe topology created in the VM at least for NVIDIA but there will be some logic for AMD/Intel GPUs as well how the driver is enumerating the index for each GPU, this is important later to map the correct GPU to the correct container in a Pod.

A container will request nvidia.com/gpu: 1, which will map in the Device Plugin to a specific VFIO device; the VFIO device is a specific GPU with a specific index that needs to be propagated to the kata-agent.

In the kata-agent we "only" use the CDI crate (WIP) to modify the OCI spec and inject the proper devices from the annotations provided by the outer runtime, see as an example: https://github.com/kata-containers/kata-containers/pull/9584

The outer runtime needs to do the following: https://github.com/kata-containers/kata-containers/pull/8861

Azure e.g. disables all outer runtime hooks and we should avoid outer runtime hooks as much as possible since in the confidential use-case we do not trust the outer runtime, this is also where the agent policy comes into place.

The kata-agent can do "anything" because it is in the TCB (Trusted Compute Base) inside the VM. All trusted components need to run in the VM so having the drivers on the host is a no-go and BTW we had this model with Google many years ago and proved to be very problematic in so many ways. I would highly discourage going this way since this breaks a lot of use-cases.

Can GPUs from other vendors use existing CDI-related tools to generate CDI configs ?

In the NVIDA case the device-plugin creates the VFIO CDI specs. This is simple enough so I would assume that a Intel or AMD device-plugin does the same, this will integrated into DRA see e.g.: https://github.com/NVIDIA/k8s-dra-driver

Meta-information can also be supplied by DP and DRA plugin but this is vendor dependent. We can add some logic to kata-ctl for generating VFIO CDI specs but I assume that each vendor will have it's own CDI spec generation tool.

If NVDIA GPU drivers are not installed on host, can we use cdi-related tools to generate CDI configs?

Yes, the nvidia-rootfs will generate the CDI specs during driver load and initialization of the GPU, we need per vendor special commands to make it work properly (will link a PR soon here)

If CDI's meta data(or CDI spec annotations) can be mapped with OCI Spec annotations in containerd/CDI processing stage ?

containerd currently has no support for it, but you can read the CDI spec in the outer runtime and setup the VM accordingly the kata-agent does not need any additional meta information since the outer runtime sets up the VM and PCIe topology.

The https://github.com/kubernetes/enhancements/pull/4113 will introduce some changes to containerd that we're working on and we will consider meta information in this context. Will tag you on it.