pillar: Add support to CDI for native containers

rene commented 2 weeks ago

The Container Device Interface (CDI) is already available on containerd and that are CDI files provided for NVIDIA platform that describes GPU devices for native container access. This commit introduces the support for CDI on pillar, so Video I/O adapters can pass a CDI device name to enable the direct access on native containers.

The corresponding documentation is also provided.

deitch commented 2 weeks ago

Two questions.

First, I had thought that the CDI was going to be in /opt/vendor/<vendor>/, so that the CDI file is in a known place, vendor-specific, and thus is already mounted into pillar? CDI files do not have to be in /etc/cdi/, it is configurable.

Second, I could not figure out where you consume the CDI files. Or are you assuming that by default, it will look in /etc/cdi/?

rene commented 2 weeks ago

Two questions.

First, I had thought that the CDI was going to be in /opt/vendor/<vendor>/, so that the CDI file is in a known place, vendor-specific, and thus is already mounted into pillar? CDI files do not have to be in /etc/cdi/, it is configurable.

We need to set the CDI directory in containerd's config file (https://github.com/lf-edge/eve/blob/master/pkg/dom0-ztools/rootfs/etc/containerd/config.toml) but directories like /opt/vendor/nvidia/cdi will not be available on all platforms, throwing an error when the directory doesn't exist

By using /etc/cdi I can push any CDI file direct to this directory without the needs to change /etc/containerd/config.toml which is part of pkg/dom0-ztools

In my tests, configuring another directory other than /etc/cdi in the config.toml didn't work, I didn't have time to debug more deeply, once I figured out, maybe the custom CDI directory can be implemented....

Second, I could not figure out where you consume the CDI files. Or are you assuming that by default, it will look in /etc/cdi/?

Here: https://github.com/lf-edge/eve/pull/4265/files#diff-b4bffe1f639681bb5f53d8804f2010200cc30b19a149577433ae128d00e03175R181

I just need to inform the device string, containerd's CDI plugin does the whole job to look for the CDI files (in the directory configured in config.toml), consume them and populate the OCI spec...

deitch commented 2 weeks ago

containerd's config file (https://github.com/lf-edge/eve/blob/master/pkg/dom0-ztools/rootfs/etc/containerd/config.toml)

Isn't that system containerd, and we want this for user containerd? Although I spent time with Paul earlier today that showed that user apps are running on system containerd? I lost track. 🤷‍♂️

By using /etc/cdi I can push any CDI file direct to this directory without the needs to change /etc/containerd/config.toml which is part of pkg/dom0-ztools

So what would be the process for a specific vendor? Root filesystem is immutable, so /etc/cdi (which is mounted into the pillar container) should be immutable. How do I get my CDI files in there? Or is the assumption that they are part of the rootfs build? How would that work?

We do have the pillar /opt/vendor/*/init.sh process, where we could just link /etc/cdi -> /opt/vendor/<foo>/cdi/? I am not sure that is better, though. I always have a concern with modifying things on the root filesystem and mounting it in.

In my tests, configuring another directory other than /etc/cdi in the config.toml didn't work

Different containerd? 🤷‍♂️

rene commented 2 weeks ago

containerd's config file (https://github.com/lf-edge/eve/blob/master/pkg/dom0-ztools/rootfs/etc/containerd/config.toml)

Isn't that system containerd, and we want this for user containerd? Although I spent time with @christoph-zededa earlier today that showed that user apps are running on system containerd? I lost track. 🤷‍♂️

No, user containerd is used only for CAS, user apps run on system's containerd....

By using /etc/cdi I can push any CDI file direct to this directory without the needs to change /etc/containerd/config.toml which is part of pkg/dom0-ztools

So what would be the process for a specific vendor? Root filesystem is immutable, so /etc/cdi (which is mounted into the pillar container) should be immutable. How do I get my CDI files in there? Or is the assumption that they are part of the rootfs build? How would that work?

We do have the pillar /opt/vendor/*/init.sh process, where we could just link /etc/cdi -> /opt/vendor/<foo>/cdi/? I am not sure that is better, though. I always have a concern with modifying things on the root filesystem and mounting it in.

Me too, I have a security concern about allow runtime CDI files, so the assumption is that any CDI file must be part of rootfs build, which, IMO makes sense since they are usually very hardware specific and will be provided by specific packages, such as the pkg/nvidia....

In my tests, configuring another directory other than /etc/cdi in the config.toml didn't work

Different containerd? 🤷‍♂️

Unfortunately not, the containerd config is correct, it's the system's containerd....

deitch commented 2 weeks ago

No, user containerd is used only for CAS, user apps run on system's containerd

Yeah, that came up yesterday. We probably should change that, but well beyond the scope of this PR.

Me too, I have a security concern about allow runtime CDI files, so the assumption is that any CDI file must be part of rootfs build, which, IMO makes sense since they are usually very hardware specific and will be provided by specific packages, such as the pkg/nvidia....

What happens if you have 100 different devices, all of the same family, with slightly different CDI? Are you going to have 100 different rootfs builds? Or just one, with multiple CDIs, and the ability to detect each? This gets us very much down the road of different builds because of a single few KB config file in /etc/.

Unfortunately not, the containerd config is correct, it's the system's containerd

So, we mount /etc/cdi into pillar, just so that we can retrieve the devices and modify the container spec, but in the end that gets passed to system containerd anyways, which runs (by definition) outside of pillar?

I didn't quite get what we are doing with that chunk of code inside pillar. We inject the devices into the container spec based on the name. Essentially, we are duplicating what containerd normally does?

rene commented 2 weeks ago

What happens if you have 100 different devices, all of the same family, with slightly different CDI? Are you going to have 100 different rootfs builds? Or just one, with multiple CDIs, and the ability to detect each? This gets us very much down the road of different builds because of a single few KB config file in /etc/.

It's totally fine to have multiple CDI files under /etc/cdi, we don't need a single build per device. Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin...

Unfortunately not, the containerd config is correct, it's the system's containerd

So, we mount /etc/cdi into pillar, just so that we can retrieve the devices and modify the container spec, but in the end that gets passed to system containerd anyways, which runs (by definition) outside of pillar?

Yes, this can be improved when we move execution of Edge Apps to user containerd...

I didn't quite get what we are doing with that chunk of code inside pillar. We inject the devices into the container spec based on the name. Essentially, we are duplicating what containerd normally does?

No, this parses the I/O adapters list from the device model. The way we give direct access to GPU is exactly the same way as we use for passthrough PCI devices, the difference is that instead of giving a PCIe Bus Address in the device model, we pass the CDI string for the particular device. The CDI is only used for GPU access for now, for Serial devices and any other device (like a webcam under /dev/video0) we read them from the device model and add them to the OCI spec. For standard containers (with ShimVM) we just give full access to all file devices. In the documentation being added you can see some examples and a better description: https://github.com/lf-edge/eve/pull/4265/files#diff-8230cff5878b3df207474c79828836840673a5ac49fdc808ba034809902cac96

deitch commented 2 weeks ago

It's totally fine to have multiple CDI files under /etc/cdi, we don't need a single build per device. Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin...

OK, that works, thanks.

No, this parses the I/O adapters list from the device model. The way we give direct access to GPU is exactly the same way as we use for passthrough PCI devices, the difference is that instead of giving a PCIe Bus Address in the device model, we pass the CDI string for the particular device. The CDI is only used for GPU access for now, for Serial devices and any other device (like a webcam under /dev/video0) we read them from the device model and add them to the OCI spec. For standard containers (with ShimVM) we just give full access to all file devices. In the documentation being added you can see some examples and a better descriptio

So this is about translating between what came in the EVE API request for devices to pass to the app, and the CDI file format, so that it will know what to do with it? That would make sense to me, but the doc to which you linked implies that the spec comes with the CDI attribute? So why translate?

rene commented 2 weeks ago

So this is about translating between what came in the EVE API request for devices to pass to the app, and the CDI file format, so that it will know what to do with it? That would make sense to me, but the doc to which you linked implies that the spec comes with the CDI attribute? So why translate?

I don't know if I understood your question, but the main idea is that the CDI string works as a "hardware ID", the same way we specify PCIe Bus Address, network interface names (eth0, wlan0), etc, in the hardware device model, for this particular case we specify the CDI string which points to the device described in the CDI file.... this makes the "native container GPU passthrough" work transparently with the controller, so the user can "pass-through" a GPU for a native container in the same way it pass-through a PCI Video card on x86, for example....

deitch commented 2 weeks ago

Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin..

I was just thinking about this. How do you distinguish between them? Aren't the device names similar? Won't you have conflicts?

deitch commented 2 weeks ago

I don't know if I understood your question, but the main idea is that the CDI string works as a "hardware ID", the same way we specify PCIe Bus Address, network interface names (eth0, wlan0), etc, in the hardware device model, for this particular case we specify the CDI string which points to the device described in the CDI file.... this makes the "native container GPU passthrough" work transparently with the controller, so the user can "pass-through" a GPU for a native container in the same way it pass-through a PCI Video card on x86, for example

What I meant was, if we already define the CDI string inside the app instance, e.g. nvidia.com/gpus=0, then that is how containerd will take it and build the right config.json OCI spec for it, so why does pillar need to have access to it? What is that chunk of code doing?

rene commented 2 weeks ago

Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin..

I was just thinking about this. How do you distinguish between them? Aren't the device names similar? Won't you have conflicts?

We can naming these devices whatever we want, originally nvidia-ctk will always use "nvidia.com/gpu" during the CDI generation, I just changed them to nvidia.com/xavier-gpu and nvidia.com/orin-gpu

rene commented 2 weeks ago

I don't know if I understood your question, but the main idea is that the CDI string works as a "hardware ID", the same way we specify PCIe Bus Address, network interface names (eth0, wlan0), etc, in the hardware device model, for this particular case we specify the CDI string which points to the device described in the CDI file.... this makes the "native container GPU passthrough" work transparently with the controller, so the user can "pass-through" a GPU for a native container in the same way it pass-through a PCI Video card on x86, for example

What I meant was, if we already define the CDI string inside the app instance, e.g. nvidia.com/gpus=0, then that is how containerd will take it and build the right config.json OCI spec for it, so why does pillar need to have access to it? What is that chunk of code doing?

We don't define CDI string inside the Edge App, the Edge App should be as any regular Edge App, the only requirement is to have NO_HYPER as the virtualization mode. Then, we are going to passthrough a GPU I/O adapter to this Edge App, as any regular PCI passthrough... the trick happens when we parse the I/O adapters from the device model and found the "cdi" attribute under "cbattr", so for native containers (and only for native containers) we will use this string as a CDI device and processing accordingly....

This approach makes the CDI solution 100% compatible with the current passthrough mechanism and it requires no changes neither in the API nor in the controller side....

deitch commented 2 weeks ago

Ok, now that makes sense. So there still is a "translation" going on between "how GPU appears in EVE API" and "how GPU is listed in CDI files". The work in pillar is there to do that translation. Correct? Can we capture that in the docs?

rene commented 2 weeks ago

Ok, now that makes sense. So there still is a "translation" going on between "how GPU appears in EVE API" and "how GPU is listed in CDI files". The work in pillar is there to do that translation. Correct? Can we capture that in the docs?

Correct. Ok, I will update the documentation...

rene commented 2 weeks ago

Updates in this PR:

Documentation updated.

lf-edge / eve

pillar: Add support to CDI for native containers #4265