Open Apokleos opened 7 months ago
cc @lifupan @zvonkok
We need to sync on this @lifupan @Apokleos @bergwolf. This will introduce complexities and other "problems" that have already been solved. There is a far easier way to accomplish this, and you're tying proprietary logic to Kata. Let me follow up with a concrete example next week how go-runtime solves it. Kata shouldn't be responsible for creating the inner runtime CDI config, vendor should, Kata should not load any module, vendor should, especially in the CoCo use-case where we have some ordering of tools and modules.
We do not want to change the logic everytime new HW is released.
Prestart hooks shouldn't be used at a minimuml, this is the architecture we had many many years ago and introduced a lot of headache. The workload decides what tools/libraries/ are available in the container not Kata. What is meant by the host? host == guest VM? Kata should not be responsible for creating any links, modifications to the vendor files this is accomplished by CDI.
We have a unified way of doing this, let me give you an overview next week and we can iterate over it.
Besides the regular containers we also have management containers that need different CDI specs, and may change. We shouldn't push any vendor specific logic into Kata, this is where CDI is helping.
Hi @zvonkok
Sure, we really don’t want to introduce the vendor’s private logic to kata, but want to make a clear separation between kata and vendor; in addition, we want the vendor to do as little work as possible and to be able to integrate well with kata. If you have a better way, we can discuss it together.
We need to sync on this @lifupan @Apokleos @bergwolf. This will introduce complexities and other "problems" that have already been solved. There is a far easier way to accomplish this, and you're tying proprietary logic to Kata. Let me follow up with a concrete example next week how go-runtime solves it. Kata shouldn't be responsible for creating the inner runtime CDI config, vendor should, Kata should not load any module, vendor should, especially in the CoCo use-case where we have some ordering of tools and modules.
We do not want to change the logic everytime new HW is released.
Prestart hooks shouldn't be used at a minimuml, this is the architecture we had many many years ago and introduced a lot of headache. The workload decides what tools/libraries/ are available in the container not Kata. What is meant by the host? host == guest VM? Kata should not be responsible for creating any links, modifications to the vendor files this is accomplished by CDI.
Host is not guest VM, is the node where kata pod and containerd running on. In this solution, there's only one outer runtime CDI(containerd/CDI) will be involved.
We have a unified way of doing this, let me give you an overview next week and we can iterate over it.
Besides the regular containers we also have management containers that need different CDI specs, and may change. We shouldn't push any vendor specific logic into Kata, this is where CDI is helping.
Thx @zvonkok some questions about NVIDIA GPU with CDI:
- Can GPUs from other vendors use existing CDI-related tools to generate CDI configs ?
- If NVDIA GPU drivers are not installed on host, can we use cdi-related tools to generate CDI configs?
- If CDI's meta data(or CDI spec annotations) can be mapped with OCI Spec annotations in containerd/CDI processing stage ?
Having the drivers on the host will break several use cases, including Confidential Containers and vGPU. The drivers should not be installed on the host. In the vGPU use case, we have host drivers and guest drivers.
In the case of Confidential Containers, the PF and guest-drivers VF drivers are mandatory in the VM, not on the host.
The outer runtimes handle the VFIO devices, and depending on thevfio_mode={guest-kernel, vfio}
, the kata-agent handles the "actual" device or VFIO group.
The outer runtime takes the VFIO device checks what type it is and annotates the OCI spec with a vendor CDI annotation e.g.
cdi.k8s.io/vfio17: nvidia.com/gpu=0
cdi.k8s.io/vfio17: nvidia.com/gpu=1
or for Intel, AMD gpus:
cdi.k8s.io/vfio22: intel.com/gpu=0
cdi.k8s.io/vfio22: amd.com/gpu=0
The index is dependent on the PCIe topology created in the VM at least for NVIDIA but there will be some logic for AMD/Intel GPUs as well how the driver is enumerating the index for each GPU, this is important later to map the correct GPU to the correct container in a Pod.
A container will request nvidia.com/gpu: 1
, which will map in the Device Plugin to a specific VFIO device; the VFIO device is a specific GPU with a specific index that needs to be propagated to the kata-agent.
In the kata-agent we "only" use the CDI crate (WIP) to modify the OCI spec and inject the proper devices from the annotations provided by the outer runtime, see as an example: https://github.com/kata-containers/kata-containers/pull/9584
The outer runtime needs to do the following: https://github.com/kata-containers/kata-containers/pull/8861
Azure e.g. disables all outer runtime hooks and we should avoid outer runtime hooks as much as possible since in the confidential use-case we do not trust the outer runtime, this is also where the agent policy comes into place.
The kata-agent can do "anything" because it is in the TCB (Trusted Compute Base) inside the VM. All trusted components need to run in the VM so having the drivers on the host is a no-go and BTW we had this model with Google many years ago and proved to be very problematic in so many ways. I would highly discourage going this way since this breaks a lot of use-cases.
Can GPUs from other vendors use existing CDI-related tools to generate CDI configs ?
In the NVIDA case the device-plugin creates the VFIO CDI specs. This is simple enough so I would assume that a Intel or AMD device-plugin does the same, this will integrated into DRA see e.g.: https://github.com/NVIDIA/k8s-dra-driver
Meta-information can also be supplied by DP and DRA plugin but this is vendor dependent.
We can add some logic to kata-ctl
for generating VFIO CDI specs but I assume that each vendor will have it's own CDI spec generation tool.
If NVDIA GPU drivers are not installed on host, can we use cdi-related tools to generate CDI configs?
Yes, the nvidia-rootfs will generate the CDI specs during driver load and initialization of the GPU, we need per vendor special commands to make it work properly (will link a PR soon here)
If CDI's meta data(or CDI spec annotations) can be mapped with OCI Spec annotations in containerd/CDI processing stage ?
containerd currently has no support for it, but you can read the CDI spec in the outer runtime and setup the VM accordingly the kata-agent does not need any additional meta information since the outer runtime sets up the VM and PCIe topology.
The https://github.com/kubernetes/enhancements/pull/4113 will introduce some changes to containerd that we're working on and we will consider meta information in this context. Will tag you on it.
Issues with Kata using Diverse Vendor GPUs
The diverse usage patterns across different GPU vendors introduce significant compatibility challenges for users employing Kata with GPUs from various manufacturers. This necessitates extensive adaptation efforts, hindering both user experience and Kata's widespread adoption. Building on the success of Kata's Container Device Interface (CDI) integration with NVIDIA GPUs, we propose a unified and simplified approach to address this challenge.
Goals:
A unified framework promotes compatibility for diverse vendor GPUs, reducing the need for custom adaptations
Solution:
"kata-linux-gpu-drivers"
to create soft links for various GPU drivers and point them to this path. Create a soft link from different GPU driver paths to this designated target path. This path is located in/opt/kata/share/kata-linux-gpu-drivers
kata-ctl
cdi subcommands to manage GPU drivers. These subcommands will read files from the designated path, construct file lists based on template files, and generate CDI configuration files.*.ko
files or services configuraions accessible within the guest environment.Key issues to address:
1. How to generate GPU driver configurations ? As we all know, not all GPU driver files need to be shared with the container environment. To address this, we provide a customizable template file with limited extensibility which acts as a filter. By comparing entries in this template file, we can effectively filter out unnecessary parts, ensuring that the generated configuration only includes essential content. We'll use NVIDIA GPU Driver as an example to illustrate. A possible template file for NVIDIA GPU drivers looks like as below:
Then based on this template, we need to navigate to the GPU driver directory and perform file search matching. During driver file searching, entries in the template file are matched against the discovered files, and only those that matched are retained, ensuring that the newly generated driver-config.yaml incorporates solely the essential components.
In this design, we do require such intermediate file named driver-config.yaml, which serves as the input for CDI conversion. Since different scenarios may involve distinct generation methods, the primary approaches currently employed are as follows:
2. How to generate CDI configurations ? To address this issue, The CDI Specification has been introduced in kata-ctl gpu subcommand to help convert driver specifications into CDI specifications. Building upon the solution to the first key issue, we will proceed to convert the GPU Driver config data structure to the corresponding CDI Spec data structure and utilize this conversion as the foundation to generate the expected CDI configuration file. During the conversion process, we perform corresponding conversions based on the type of each item in the driver config. For instance, for library files and executable files, we need to construct a
Mount
, properly configurecontainerPath
andhostPath
information, and set any required environment variables.3. Are there tools available to assist with the conversion of config and cdi? Yes, of course. To address this, new subcommands
gpu gen-cfg
andgpu gen-cdi
will be introduced to the tookata-ctl
to facilitate the conversion process for GPU drivers. But they're not all.4. How to set vfio Device (GPU) assigned to Kata runtime in CDI config ? This task requires collaboration between the device-plugin or other components and kata-ctl. The device-plugin or other components should provide the IOMMU Group ID and other relevant information of the allocated GPU devices to the kata-ctl gpu commands. These commands will then be responsible for completing the devices section in the CDI configuration. An example of the devices configuration is as follows:
Reference
CDI nvc_info.c
What's your opinions about this RFC, comments please !