opencontainers / runtime-spec

OCI Runtime Specification
http://www.opencontainers.org
Apache License 2.0
3.19k stars 541 forks source link

Proposal: Network Devices #1239

Open aojea opened 9 months ago

aojea commented 9 months ago

The spec describes Devices that are container based, but there are another class of Devices, Network Devices that are defined per namespace, quoting "Linux Device Drivers, Second Edition , Chapter 14. Network Drivers"

Chapter 14. Network Drivers We are now through discussing char and block drivers and are ready to move on to the fascinating world of networking. Network interfaces are the third standard class of Linux devices, and this chapter describes how they interact with the rest of the kernel.

The role of a network interface within the system is similar to that of a mounted block device. A block device registers its features in the blk_dev array and other kernel structures, and it then “transmits” and “receives” blocks on request, by means of its request function. Similarly, a network interface must register itself in specific data structures in order to be invoked when packets are exchanged with the outside world.

There are a few important differences between mounted disks and packet-delivery interfaces. To begin with, a disk exists as a special file in the /dev directory, whereas a network interface has no such entry point. The normal file operations (read, write, and so on) do not make sense when applied to network interfaces, so it is not possible to apply the Unix “everything is a file” approach to them. Thus, network interfaces exist in their own namespace and export a different set of operations.

Although you may object that applications use the read and write system calls when using sockets, those calls act on a software object that is distinct from the interface. Several hundred sockets can be multiplexed on the same physical interface.

But the most important difference between the two is that block drivers operate only in response to requests from the kernel, whereas network drivers receive packets asynchronously from the outside. Thus, while a block driver is asked to send a buffer toward the kernel, the network device asks to push incoming packets toward the kernel. The kernel interface for network drivers is designed for this different mode of operation.

Network drivers also have to be prepared to support a number of administrative tasks, such as setting addresses, modifying transmission parameters, and maintaining traffic and error statistics. The API for network drivers reflects this need, and thus looks somewhat different from the interfaces we have seen so far.

Network Devices are also used for providing connectivity to the network namespaces, and commonly container runtimes use the CNI specification to provide this capacity of adding a network device to the namespace and configure its networking parameters.

Runc already has the concept of network device and how to configure it, in addition to the CNI specifixation https://github.com/opencontainers/runc/tree/main/libcontainer

package configs

// Network defines configuration for a container's networking stack
//
// The network configuration can be omitted from a container causing the
// container to be setup with the host's networking stack
type Network struct {

https://github.com/opencontainers/runc/blob/main/libcontainer/configs/network.go#L3-L51

The spec already has a reference to the network in https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#network , that references network devices, but does not allow to specify the network devices that will be part of the namespace.

However, there are cases that a Kubernetes Pod or container may want to add, in a declarative way, existing Network Devices to the namespace, it is important to mention that the Network Device configuration or creation is non-goal and is left out of the spec on purpose.

The use cases for adding network devices to namespaces are more common lately with the new AI accelerators devices that are presented as network devices to the system, but they are not really considered as an usual network device. Ref: https://lwn.net/Articles/955001/ (Available Jan 4th without subscription)

The proposal is to be able to add existing Network devices to a linux namespace by referencing them https://docs.kernel.org/networking/netdevices.html, in a similar way to the existing definition of Devices

Linux defines an structure like this one in https://man7.org/linux/man-pages/man7/netdevice.7.html

       This man page describes the sockets interface which is used to
       configure network devices.

       Linux supports some standard ioctls to configure network devices.
       They can be used on any socket's file descriptor regardless of
       the family or type.  Most of them pass an ifreq structure:

           struct ifreq {
               char ifr_name[IFNAMSIZ]; /* Interface name */
               union {
                   struct sockaddr ifr_addr;
                   struct sockaddr ifr_dstaddr;
                   struct sockaddr ifr_broadaddr;
                   struct sockaddr ifr_netmask;
                   struct sockaddr ifr_hwaddr;
                   short           ifr_flags;
                   int             ifr_ifindex;
                   int             ifr_metric;
                   int             ifr_mtu;
                   struct ifmap    ifr_map;
                   char            ifr_slave[IFNAMSIZ];
                   char            ifr_newname[IFNAMSIZ];
                   char           *ifr_data;
               };
           };

though we only need the index or the name to be able to reference one interface

   Normally, the user specifies which device to affect by setting
   ifr_name to the name of the interface or ifr6_ifindex to the
   index of the interface.  All other members of the structure may
   share memory.
## <a name="configLinuxNetDevices" />NetDevices

**`netDevices`** (array of objects, OPTIONAL) lists of network devices that MUST be available in the container network namespace.

Each entry has the following structure:

* **`name`** *(string, REQUIRED)* - name of the network device in the host.
* **`properties`** *(object, OPTIONAL)* - properties the network device per https://man7.org/linux/man-pages/man7/netdevice.7.html in the container namespace.
    has the following structure:
    * **`name`** *(string, OPTIONAL)* - name of the network device in the network namespace.
    * **`address`** *(string, OPTIONAL)* - address of the network device in the network namespace
    * **`mask`** *(string, OPTIONAL)* - mask of the network device in the network namespace
    * **`mtu`** *(uint16, OPTIONAL)* - MTU size of the network device in the network namespace

### Example

```json
"netdevices": [
    {
        "name": "eth0",
        "properties": {
              name: "ns1",
              address: "192.168.0.1",
              mask: "255.255.255.0",
              mtu: 1500, 
        }
}
    },
    {
        "name": "ens4",
    }
]

Proposal: https://github.com/opencontainers/runtime-spec/pull/1240 runc prototype: https://github.com/opencontainers/runc/compare/main...aojea:runc:netdevices?expand=1

References:

aojea commented 9 months ago

/cc @samuelkarp

AkihiroSuda commented 9 months ago

Could you explain why CNI can't be extended to support your use case?

I also wonder if OCI hooks can be used.

aojea commented 9 months ago

Could you explain why CNI can't be extended to support your use case?

CNI is about network interface creation and configuration https://github.com/containernetworking/cni/blob/main/SPEC.md#cni-operations

ADD: Add container to network, or apply modifications A CNI plugin, upon receiving an ADD command, should either

create the interface defined by CNI_IFNAME inside the container at CNI_NETNS, or adjust the configuration of the interface defined by CNI_IFNAME inside the container at CNI_NETNS.

CNI is also an implementation detail of container runtimes, and has some limitations, in Kubernetes projects use annotations and different out of band methods to pass this additional information for other interfaces, more on https://github.com/containernetworking/cni/issues/891

In kubernetes, Pods use devices at the container level and it maps to the OCI specification.

I think that most of the problems in this area come because we are trying to conflate network device and network configuration, my proposal is to decouple this, so adding a new field to Pods as netDevice at the Pod level to map the OCI specification, IMHO this will solve elegantly the Pod and container multi-interface problem , leaving to the CNI, the user app or the network plugins the configuration of these netDevices,

aojea commented 9 months ago

Could you explain why CNI can't be extended to support your use case?

CNI is about network interface creation and configuration https://github.com/containernetworking/cni/blob/main/SPEC.md#cni-operations

ADD: Add container to network, or apply modifications A CNI plugin, upon receiving an ADD command, should either create the interface defined by CNI_IFNAME inside the container at CNI_NETNS, or adjust the configuration of the interface defined by CNI_IFNAME inside the container at CNI_NETNS.

CNI is also an implementation detail of container runtimes, and has some limitations, in Kubernetes projects use annotations and different out of band methods to pass this additional information for other interfaces, more on containernetworking/cni#891

In kubernetes, Pods use devices at the container level and it maps to the OCI specification.

I think that most of the problems in this area come because we are trying to conflate network device and network configuration, my proposal is to decouple this, so adding a new field to Pods as netDevice at the Pod level to map the OCI specification, IMHO this will solve elegantly the Pod and container multi-interface problem , leaving to the CNI, the user app or the network plugins the configuration of these netDevices,

I also wonder if OCI hooks can be used.

I'm not well versed in this area, I had this conversation with @samuelkarp , and he thought it was worth at least to open this debate,

samuelkarp commented 9 months ago

I also wonder if OCI hooks can be used.

I think they can. But it moves control from a declarative model (like the rest of the OCI spec) to imperative via the hook implementation. If the goal for the runtime spec is to allow a bundle author to specify the attributes of the container and for a runtime (such as runc) to implement, I do think it'd be nice to include some aspects of networking in that as well.

However, networking is fairly complex. @aojea I'm still not entirely clear on exactly what you'd like to see here (e.g., just interface moves? veth creation? setting up routes? etc). Can you elaborate a bit more?

aojea commented 9 months ago

@aojea I'm still not entirely clear on exactly what you'd like to see here (e.g., just interface moves? veth creation? setting up routes? etc). Can you elaborate a bit more?

just interface moves, being able to reference any netDevice in the host to move into the container network namespace

aojea commented 8 months ago

After spending a few weeks exploring different options, I can find how all these new patterns enabled by the CDI https://github.com/cncf-tags/container-device-interface can benefit Kubernetes and all containers environments of instructing runtimes to move some specific netdevice by name into the runtime namespace, @elezar WDYT?

Right now you have to do an exotic dance between annotations and out of band operations just to get the information to the CNI plugin to be able to move one interface to the network namespace, if the container runtimes can declaratively move the netdivce specified by name into the network namespace, everything will be much simpler

aojea commented 7 months ago

My main use case it to model GPUs and its relation with the high speed NICs used for GPUDirect.

GPU0    GPU1    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X  SYS NODE    NODE    SYS SYS 0,2,4,6,8,10    0
GPU1    SYS  X  SYS SYS PHB PHB 1,3,5,7,9,11    1
mlx5_0  NODE    SYS  X  PIX SYS SYS     
mlx5_1  NODE    SYS PIX  X  SYS SYS     
mlx5_2  SYS PHB SYS SYS  X  PIX     
mlx5_3  SYS PHB SYS SYS PIX  X      

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

It is complex to model this relation in systems like kubernetes, since traditionally NICs are treated as part of the CNI, but in this case, the NICs are only netdevices associated to the GPUs, they are consumed directly by the GPU, and not by the Kubernetes cluster or users.

If the OCI spec support "netdevices", it is possible to use mechanisms like CDI to mutate the OCI spec and add this bundle in a declarative way to the Pod, so an user can create a Pod or a Container requesting one or multiple GPUs, and the https://github.com/cncf-tags/container-device-interface CDI driver can mutate the OCI spec to add the NICs/Netdevices associated, without the users having to do the manual plumbing that is error prone, device drivers can always check the Node topology and assign the best NIC or NICs for each case

cc: @klueska

wojtek-t commented 6 months ago

/cc

zshi-redhat commented 6 months ago

/cc

samuelkarp commented 5 months ago

Runc already has the concept of network device and how to configure it, in addition to the CNI specifixation

runc's Network type (part of libcontainer) does not seem to be used by the code related to bundle parsing; it appears (from git blame) to be from January & February 2015, before the OCI was established (and possibly before libcontainer was even factored out of Docker). It does have a fairly decent number of parameters, though it appears to be focused on interface creation (new loopback or veth pair) rather than moves.

Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?

runc Network type ``` type Network struct { // Type sets the networks type, commonly veth and loopback Type string `json:"type"` // Name of the network interface Name string `json:"name"` // The bridge to use. Bridge string `json:"bridge"` // MacAddress contains the MAC address to set on the network interface MacAddress string `json:"mac_address"` // Address contains the IPv4 and mask to set on the network interface Address string `json:"address"` // Gateway sets the gateway address that is used as the default for the interface Gateway string `json:"gateway"` // IPv6Address contains the IPv6 and mask to set on the network interface IPv6Address string `json:"ipv6_address"` // IPv6Gateway sets the ipv6 gateway address that is used as the default for the interface IPv6Gateway string `json:"ipv6_gateway"` // Mtu sets the mtu value for the interface and will be mirrored on both the host and // container's interfaces if a pair is created, specifically in the case of type veth // Note: This does not apply to loopback interfaces. Mtu int `json:"mtu"` // TxQueueLen sets the tx_queuelen value for the interface and will be mirrored on both the host and // container's interfaces if a pair is created, specifically in the case of type veth // Note: This does not apply to loopback interfaces. TxQueueLen int `json:"txqueuelen"` // HostInterfaceName is a unique name of a veth pair that resides on in the host interface of the // container. HostInterfaceName string `json:"host_interface_name"` // HairpinMode specifies if hairpin NAT should be enabled on the virtual interface // bridge port in the case of type veth // Note: This is unsupported on some systems. // Note: This does not apply to loopback interfaces. HairpinMode bool `json:"hairpin_mode"` } ```
aojea commented 5 months ago

Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?

the later, Network Type and network configuration is just what I want to avoid, is unbounded and contentious ... on the other side, moving host interfaces to container namespaces is IMHO well defined and solves important use cases very easily, my reasoning is that same as block devices are moved into the container namespace, network devices can be moved "declaratively" too, there should be possible to define some of the properties of struct ifreqas name, and address but I really will like to avoid any dynamic configuration ala CNI ... that should be solved at another layer

Specially interesting is the case where some devices have both an RDMA and a Netdevice, this will solve this problem really well, instead of having to split the responsibility of the RDMA device to the OCI runtime and the Netdevice to the CNI, that is always going to be racy

MikeZappa87 commented 1 month ago

@aojea this might be worth running this by Kata or the other virtualized runtimes.

aojea commented 1 month ago

@aojea this might be worth running this by Kata or the other virtualized runtimes.

are those implementing the OCI runtime spec?

MikeZappa87 commented 1 month ago

@aojea this might be worth running this by Kata or the other virtualized runtimes.

are those implementing the OCI runtime spec?

Yes the communication between the runtime is via OCI. The CreateTask api in containerd uses the runtime oci spec to communicate with the lower level runtimes. Unless something has changed :-P @mikebrow keep me honest ha!

aojea commented 1 month ago

The CreateTask api in containerd uses the runtime oci spec to communicate with the lower level runtimes.

Then is unrelated, who implements the OCI spec is containerd in this case

MikeZappa87 commented 1 month ago

containerd/kata both implement the oci spec

aojea commented 1 month ago

containerd/kata both implement the oci spec

https://github.com/kata-containers/kata-containers/blob/main/docs/Limitations.md

MikeZappa87 commented 1 month ago

I have seen that link however I’m sure that’s is why the shim may exist? The high level runtime uses the oci spec to communicate with the low level runtime via oci.

Just quickly poking through the code.

https://github.com/kata-containers/kata-containers/blob/f24983b3cf6a5e8bcd006a3aad718fa9b83396af/src/runtime/pkg/oci/utils.go#L1011

I would need to trace it down in this runtime but this is what allows us to swap out runc for kata rather easily.

On Tue, Aug 27, 2024 at 7:39 AM Antonio Ojea @.***> wrote:

containerd/kata both implement the oci spec

https://github.com/kata-containers/kata-containers/blob/main/docs/Limitations.md

— Reply to this email directly, view it on GitHub https://github.com/opencontainers/runtime-spec/issues/1239#issuecomment-2312594788 or unsubscribe https://github.com/notifications/unsubscribe-auth/AOW3MYBVTNGABSTRP6543YTZTR6STBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVAZTMOJWGAZDSM4CUR2HS4DFUVUXG43VMWSXMYLMOVS2UMRQGYYTCNJQGQ4TTJ3UOJUWOZ3FOKTGG4TFMF2GK . You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

MikeZappa87 commented 1 month ago

I see value in passing a list of netDevices to the oci runtimes however I would rather have the CNI plugins create/move the netdevs to the appropriate location. While this helps the Windows container networking stack, I would want to know if alignment exists in kata and other oci runtime as well. It seems like we have a disconnect here in regards to the virtualized oci runtimes for networking.

aojea commented 1 month ago

Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?

the later, Network Type and network configuration is just what I want to avoid, is unbounded and contentious ... on the other side, moving host interfaces to container namespaces is IMHO well defined and solves important use cases very easily, my reasoning is that same as block devices are moved into the container namespace, network devices can be moved "declaratively" too, there should be possible to define some of the properties of struct ifreqas name, and address but I really will like to avoid any dynamic configuration ala CNI ... that should be solved at another layer

Specially interesting is the case where some devices have both an RDMA and a Netdevice, this will solve this problem really well, instead of having to split the responsibility of the RDMA device to the OCI runtime and the Netdevice to the CNI, that is always going to be racy

Creating and moving netdevs are on purpose out of the scope, this is a 1 to 1 mapping to the "block devices" API and functionality, so you have /dev/gpu1 or /dev/sound0 or similar and you can reference them and move into a container. In this case the OS does not represent netdevices as files (see description) but allow userspace to reference them and change their properties, so I'm proposing to provide the same functionality