k8snetworkplumbingwg / network-resources-injector

A Kubernetes Dynamic Admission Controller that patches Pods to add additional information.
Apache License 2.0
42 stars 28 forks source link

Support for user configurable resource name keys #18

Closed pperiyasamy closed 3 years ago

pperiyasamy commented 4 years ago

The network-resources-injector currently supports only one resource (per network) defined with resource name key k8s.v1.cni.cncf.io/resourceName which makes difficult to provide two distinct resources in the same network attachment definition object. Please refer to ovs-cni issue for more details. Hence I would to enhance network-resources-injector with user configurablenetwork-resource-name-keys (comma separated strings) configuration parameter. By default, it supports existing k8s.v1.cni.cncf.io/resourceName when no values are provided.

pperiyasamy commented 4 years ago

addressed in pr #19

zshi-redhat commented 4 years ago

@pperiyasamy what's the use case for one net-attach-def to have two distinct resources? Does the network defined in net-attach-def need to configure two devices for a Pod? Could that be done by having two net-attach-def attached to the same Pod?

pperiyasamy commented 4 years ago

@zshi-redhat There is a need of two resources (VF resource pool and OVS bridge) in a net-attach-def object for attaching Mellanox ConnectX 5 smartNIC VFs into pod container. The VF resource pool is for the device plugin to allocate a VF device inside the container and OVS bridge name is to attach VF 's net representer devices into the OVS bridge which is done by ovs-cni plugin. The net representer device is mainly used to offload VLANs, VxLANs etc.

Hence we need to define these resources through k8s.v1.cni.cncf.io/resourceName and k8s.v1.cni.cncf.io/bridgeName into network-resources-injector for appropriate pod placement on the nodes (where both resources exists) in k8s cluster.

zshi-redhat commented 4 years ago

@zshi-redhat There is a need of two resources (VF resource pool and OVS bridge) in a net-attach-def object for attaching Mellanox ConnectX 5 smartNIC VFs into pod container. The VF resource pool is for the device plugin to allocate a VF device inside the container and OVS bridge name is to attach VF 's net representer devices into the OVS bridge which is done by ovs-cni plugin. The net representer device is mainly used to offload VLANs, VxLANs etc.

Hence we need to define these resources through k8s.v1.cni.cncf.io/resourceName and k8s.v1.cni.cncf.io/bridgeName into network-resources-injector for appropriate pod placement on the nodes (where both resources exists) in k8s cluster.

@pperiyasamy @phoracek My udnerstanding is that VF representor is 1:1 mapped to a VF device, there is essentially one network/interface attached to Pod. Why not use ovs-cni to take care of plugging of both devices like how ovn-kubernetes does?

Also adding @dougbtv who might have more insight on standardizing the use of k8s.v1.cni.cncf.io/resourceName in NPWG. Not sure if it's allowed to have two distinct annotations for one resource in net-attach-def.

phoracek commented 4 years ago

@zshi-redhat OVS CNI uses resourceName and resource injector to make sure to schedule Pods on Nodes with the requested bridge available. It does not inject the bridge into the Pod, we just use this mechanism to handle scheduling. This is for vanilla OVS CNI.

With HW offload, we need to A) schedule the Pod on a Node with the bridge available and B) schedule the Pod on a node with the VF available. Since we can request only a single resource, it is not possible with the current network-resources-injector.

At least that's my understanding of what @pperiyasamy is aiming to solve.

pperiyasamy commented 4 years ago

@zshi-redhat OVS CNI uses resourceName and resource injector to make sure to schedule Pods on Nodes with the requested bridge available. It does not inject the bridge into the Pod, we just use this mechanism to handle scheduling. This is for vanilla OVS CNI.

With HW offload, we need to A) schedule the Pod on a Node with the bridge available and B) schedule the Pod on a node with the VF available. Since we can request only a single resource, it is not possible with the current network-resources-injector.

At least that's my understanding of what @pperiyasamy is aiming to solve.

@phoracek you are correct. We can also make sure that OVS bridge is created on the node(s) where these device pool would be available through provisioning tool and just refer to device pool name with k8s.v1.cni.cncf.io/resourceName in net-attach-def object. But this addition with network-resources-injector ensures everything in place on the node to schedule a pod and helps to detect any misconfiguration.

zshi-redhat commented 4 years ago

@zshi-redhat OVS CNI uses resourceName and resource injector to make sure to schedule Pods on Nodes with the requested bridge available. It does not inject the bridge into the Pod, we just use this mechanism to handle scheduling. This is for vanilla OVS CNI.

Right, OVS CNI adds VF representor device into OVS bridge, it doesn't attach actual VF device to Pod (which I assume is done by SR-IOV CNI ?). My question is should the work (moving actual VF to Pod) also be done by OVS-CNI? Is it possible?

With HW offload, we need to A) schedule the Pod on a Node with the bridge available and B) schedule the Pod on a node with the VF available. Since we can request only a single resource, it is not possible with the current network-resources-injector.

If OVS and VF pool are required to run OVS HW Offload, should user guarantee the two are available on node during deployment or provisioning with a high-level tool? For use of OVS-CNI as default plugin, should there be an OVS bridge created already on each node?

At least that's my understanding of what @pperiyasamy is aiming to solve.

zshi-redhat commented 4 years ago

@phoracek you are correct. We can also make sure that OVS bridge is created on the node(s) where these device pool would be available through provisioning tool and just refer to device pool name with k8s.v1.cni.cncf.io/resourceName in net-attach-def object.

Yes, if OVS HW offload is expected to be used in cluster, then OVS bridge and device pool need to be created first on the same node to avoid scheduling problem. Then it becomes a installer issue not network-resoruce-injector. If ovs brdige is a limited resource in the deployment, it sounds to me that you probably need a device plugin to manage ovs resource.

zshi-redhat commented 4 years ago

/cc @moshe010

pperiyasamy commented 4 years ago

/cc @JanScheurich

JanScheurich commented 4 years ago

@zshi-redhat OVS CNI uses resourceName and resource injector to make sure to schedule Pods on Nodes with the requested bridge available. It does not inject the bridge into the Pod, we just use this mechanism to handle scheduling. This is for vanilla OVS CNI.

Right, OVS CNI adds VF representor device into OVS bridge, it doesn't attach actual VF device to Pod (which I assume is done by SR-IOV CNI ?). My question is should the work (moving actual VF to Pod) also be done by OVS-CNI? Is it possible?

The latest OVS CNI already attaches the actual Mellanox VF netdev to the pod. The SRIOV CNI does not play a role here.

With HW offload, we need to A) schedule the Pod on a Node with the bridge available and B) schedule the Pod on a node with the VF available. Since we can request only a single resource, it is not possible with the current network-resources-injector.

If OVS and VF pool are required to run OVS HW Offload, should user guarantee the two are available on node during deployment or provisioning with a high-level tool? For use of OVS-CNI as default plugin, should there be an OVS bridge created already on each node?

At least that's my understanding of what @pperiyasamy is aiming to solve.

Yes, for OVS HW off-load to work, the operator needs to consistently configure an OVS bridge and the corresponding VF device pool in the SRIOV network device plugin. I believe a higher-level operator is called for here to automate correct configuration of all parts.

If the system is configured correctly, I doubt that for the specific use case of SmartNIC VF attachments in OVS-CNI we need more than one resource per NAD to be handled by the network resources injector. The device pool suffices.

Peri's request might be useful in other scenarios, however, for example for a hypothetical CNI that requires both a device as well as another, independent, resource, for example an automatically allocated VLAN tag (similar to how Neutron assigns segmentation IDs to internal neutron networks).

moshe010 commented 4 years ago

@zshi-redhat, it seem kubervirt has [1] a bridge resource which exposes available Open vSwitch nodes as node resources. I comment on the ovs-cni issue to better understand why it implemented like this and not with nfd for example.

@zshi-redhat , What is you main objections to inject more then one resources to the same network? it it just the use-case?

[1] - https://github.com/kubevirt/ovs-cni/blob/9115a6ca16fbda50e063429f8ceb3c8afab58856/docs/marker.md

zshi-redhat commented 4 years ago

@zshi-redhat, it seem kubervirt has [1] a bridge resource which exposes available Open vSwitch nodes as node resources. I comment on the ovs-cni issue to better understand why it implemented like this and not with nfd for example.

@zshi-redhat , What is you main objections to inject more then one resources to the same network? it it just the use-case?

I think this might not be a typical use of net-attach-def which requires two devices for one network and the major concern is if I were to consider using it with NPWG compliant meta plugin like Multus for additional network, Multus doesn't have ability to read other resource annotations from net-attach-def as it assumes one device per network. but on the other hand, I don't know if ovs-cni will be used with any meta plugin for additional network.

pperiyasamy commented 3 years ago

The PR #19 addresses this issue, closing the issue.