Open aojea opened 10 months ago
/cc @samuelkarp
Could you explain why CNI can't be extended to support your use case?
I also wonder if OCI hooks can be used.
Could you explain why CNI can't be extended to support your use case?
CNI is about network interface creation and configuration https://github.com/containernetworking/cni/blob/main/SPEC.md#cni-operations
ADD: Add container to network, or apply modifications A CNI plugin, upon receiving an ADD command, should either
create the interface defined by CNI_IFNAME inside the container at CNI_NETNS, or adjust the configuration of the interface defined by CNI_IFNAME inside the container at CNI_NETNS.
CNI is also an implementation detail of container runtimes, and has some limitations, in Kubernetes projects use annotations and different out of band methods to pass this additional information for other interfaces, more on https://github.com/containernetworking/cni/issues/891
In kubernetes, Pods use devices
at the container level and it maps to the OCI specification.
I think that most of the problems in this area come because we are trying to conflate network device and network configuration, my proposal is to decouple this, so adding a new field to Pods as netDevice
at the Pod level to map the OCI specification, IMHO this will solve elegantly the Pod and container multi-interface problem , leaving to the CNI, the user app or the network plugins the configuration of these netDevices,
Could you explain why CNI can't be extended to support your use case?
CNI is about network interface creation and configuration https://github.com/containernetworking/cni/blob/main/SPEC.md#cni-operations
ADD: Add container to network, or apply modifications A CNI plugin, upon receiving an ADD command, should either create the interface defined by CNI_IFNAME inside the container at CNI_NETNS, or adjust the configuration of the interface defined by CNI_IFNAME inside the container at CNI_NETNS.
CNI is also an implementation detail of container runtimes, and has some limitations, in Kubernetes projects use annotations and different out of band methods to pass this additional information for other interfaces, more on containernetworking/cni#891
In kubernetes, Pods use
devices
at the container level and it maps to the OCI specification.I think that most of the problems in this area come because we are trying to conflate network device and network configuration, my proposal is to decouple this, so adding a new field to Pods as
netDevice
at the Pod level to map the OCI specification, IMHO this will solve elegantly the Pod and container multi-interface problem , leaving to the CNI, the user app or the network plugins the configuration of these netDevices,I also wonder if OCI hooks can be used.
I'm not well versed in this area, I had this conversation with @samuelkarp , and he thought it was worth at least to open this debate,
I also wonder if OCI hooks can be used.
I think they can. But it moves control from a declarative model (like the rest of the OCI spec) to imperative via the hook implementation. If the goal for the runtime spec is to allow a bundle author to specify the attributes of the container and for a runtime (such as runc) to implement, I do think it'd be nice to include some aspects of networking in that as well.
However, networking is fairly complex. @aojea I'm still not entirely clear on exactly what you'd like to see here (e.g., just interface moves? veth creation? setting up routes? etc). Can you elaborate a bit more?
@aojea I'm still not entirely clear on exactly what you'd like to see here (e.g., just interface moves? veth creation? setting up routes? etc). Can you elaborate a bit more?
just interface moves, being able to reference any netDevice in the host to move into the container network namespace
After spending a few weeks exploring different options, I can find how all these new patterns enabled by the CDI https://github.com/cncf-tags/container-device-interface can benefit Kubernetes and all containers environments of instructing runtimes to move some specific netdevice by name into the runtime namespace, @elezar WDYT?
Right now you have to do an exotic dance between annotations and out of band operations just to get the information to the CNI plugin to be able to move one interface to the network namespace, if the container runtimes can declaratively move the netdivce specified by name into the network namespace, everything will be much simpler
My main use case it to model GPUs and its relation with the high speed NICs used for GPUDirect.
GPU0 GPU1 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0 X SYS NODE NODE SYS SYS 0,2,4,6,8,10 0
GPU1 SYS X SYS SYS PHB PHB 1,3,5,7,9,11 1
mlx5_0 NODE SYS X PIX SYS SYS
mlx5_1 NODE SYS PIX X SYS SYS
mlx5_2 SYS PHB SYS SYS X PIX
mlx5_3 SYS PHB SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
It is complex to model this relation in systems like kubernetes, since traditionally NICs are treated as part of the CNI, but in this case, the NICs are only netdevices associated to the GPUs, they are consumed directly by the GPU, and not by the Kubernetes cluster or users.
If the OCI spec support "netdevices", it is possible to use mechanisms like CDI to mutate the OCI spec and add this bundle in a declarative way to the Pod, so an user can create a Pod or a Container requesting one or multiple GPUs, and the https://github.com/cncf-tags/container-device-interface CDI driver can mutate the OCI spec to add the NICs/Netdevices associated, without the users having to do the manual plumbing that is error prone, device drivers can always check the Node topology and assign the best NIC or NICs for each case
cc: @klueska
/cc
/cc
Runc already has the concept of network device and how to configure it, in addition to the CNI specifixation
runc's Network
type (part of libcontainer) does not seem to be used by the code related to bundle parsing; it appears (from git blame
) to be from January & February 2015, before the OCI was established (and possibly before libcontainer was even factored out of Docker). It does have a fairly decent number of parameters, though it appears to be focused on interface creation (new loopback or veth pair) rather than moves.
Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?
Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?
the later, Network Type and network configuration is just what I want to avoid, is unbounded and contentious ... on the other side, moving host interfaces to container namespaces is IMHO well defined and solves important use cases very easily, my reasoning is that same as block devices are moved into the container namespace, network devices can be moved "declaratively" too, there should be possible to define some of the properties of struct ifreq
as name, and address but I really will like to avoid any dynamic configuration ala CNI ... that should be solved at another layer
Specially interesting is the case where some devices have both an RDMA and a Netdevice, this will solve this problem really well, instead of having to split the responsibility of the RDMA device to the OCI runtime and the Netdevice to the CNI, that is always going to be racy
@aojea this might be worth running this by Kata or the other virtualized runtimes.
@aojea this might be worth running this by Kata or the other virtualized runtimes.
are those implementing the OCI runtime spec?
@aojea this might be worth running this by Kata or the other virtualized runtimes.
are those implementing the OCI runtime spec?
Yes the communication between the runtime is via OCI. The CreateTask api in containerd uses the runtime oci spec to communicate with the lower level runtimes. Unless something has changed :-P @mikebrow keep me honest ha!
The CreateTask api in containerd uses the runtime oci spec to communicate with the lower level runtimes.
Then is unrelated, who implements the OCI spec is containerd in this case
containerd/kata both implement the oci spec
containerd/kata both implement the oci spec
https://github.com/kata-containers/kata-containers/blob/main/docs/Limitations.md
I have seen that link however I’m sure that’s is why the shim may exist? The high level runtime uses the oci spec to communicate with the low level runtime via oci.
Just quickly poking through the code.
I would need to trace it down in this runtime but this is what allows us to swap out runc for kata rather easily.
On Tue, Aug 27, 2024 at 7:39 AM Antonio Ojea @.***> wrote:
containerd/kata both implement the oci spec
https://github.com/kata-containers/kata-containers/blob/main/docs/Limitations.md
— Reply to this email directly, view it on GitHub https://github.com/opencontainers/runtime-spec/issues/1239#issuecomment-2312594788 or unsubscribe https://github.com/notifications/unsubscribe-auth/AOW3MYBVTNGABSTRP6543YTZTR6STBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVAZTMOJWGAZDSM4CUR2HS4DFUVUXG43VMWSXMYLMOVS2UMRQGYYTCNJQGQ4TTJ3UOJUWOZ3FOKTGG4TFMF2GK . You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .
I see value in passing a list of netDevices to the oci runtimes however I would rather have the CNI plugins create/move the netdevs to the appropriate location. While this helps the Windows container networking stack, I would want to know if alignment exists in kata and other oci runtime as well. It seems like we have a disconnect here in regards to the virtualized oci runtimes for networking.
Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?
the later, Network Type and network configuration is just what I want to avoid, is unbounded and contentious ... on the other side, moving host interfaces to container namespaces is IMHO well defined and solves important use cases very easily, my reasoning is that same as block devices are moved into the container namespace, network devices can be moved "declaratively" too, there should be possible to define some of the properties of
struct ifreq
as name, and address but I really will like to avoid any dynamic configuration ala CNI ... that should be solved at another layerSpecially interesting is the case where some devices have both an RDMA and a Netdevice, this will solve this problem really well, instead of having to split the responsibility of the RDMA device to the OCI runtime and the Netdevice to the CNI, that is always going to be racy
Creating and moving netdevs are on purpose out of the scope, this is a 1 to 1 mapping to the "block devices" API and functionality, so you have /dev/gpu1 or /dev/sound0 or similar and you can reference them and move into a container. In this case the OS does not represent netdevices as files (see description) but allow userspace to reference them and change their properties, so I'm proposing to provide the same functionality
The spec describes Devices that are container based, but there are another class of Devices, Network Devices that are defined per namespace, quoting "Linux Device Drivers, Second Edition , Chapter 14. Network Drivers"
Network Devices are also used for providing connectivity to the network namespaces, and commonly container runtimes use the CNI specification to provide this capacity of adding a network device to the namespace and configure its networking parameters.
Runc already has the concept of network device and how to configure it, in addition to the CNI specifixation https://github.com/opencontainers/runc/tree/main/libcontainer
https://github.com/opencontainers/runc/blob/main/libcontainer/configs/network.go#L3-L51
The spec already has a reference to the network in https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#network , that references network devices, but does not allow to specify the network devices that will be part of the namespace.
However, there are cases that a Kubernetes Pod or container may want to add, in a declarative way, existing Network Devices to the namespace, it is important to mention that the Network Device configuration or creation is non-goal and is left out of the spec on purpose.
The use cases for adding network devices to namespaces are more common lately with the new AI accelerators devices that are presented as network devices to the system, but they are not really considered as an usual network device. Ref: https://lwn.net/Articles/955001/ (Available Jan 4th without subscription)
The proposal is to be able to add existing Network devices to a linux namespace by referencing them https://docs.kernel.org/networking/netdevices.html, in a similar way to the existing definition of Devices
Linux defines an structure like this one in https://man7.org/linux/man-pages/man7/netdevice.7.html
though we only need the index or the name to be able to reference one interface
Proposal: https://github.com/opencontainers/runtime-spec/pull/1240 runc prototype: https://github.com/opencontainers/runc/compare/main...aojea:runc:netdevices?expand=1
References: