k8snetworkplumbingwg / ovs-cni

Open vSwitch CNI plugin
Apache License 2.0
224 stars 71 forks source link

Rename the key for ovs-cni.network.kubevirt.io/br1 #107

Closed pperiyasamy closed 4 years ago

pperiyasamy commented 4 years ago

We have problem with net-attach-def objects when it is required to have both device (for device plugin) and bridge(for ovs-cni) as resource names. This is mainly because both are described with same k8s.v1.cni.cncf.io/resourceName key which makes recent value replaces the old one.

For example:

cat <<EOF | kubectl create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ovs-conf
  annotations:
    k8s.v1.cni.cncf.io/resourceName: ovs-cni.network.kubevirt.io/br1
    k8s.v1.cni.cncf.io/resourceName: intel.com/mellanox_snic0
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "ovs",
      "bridge": "br1",
      "vlan": 100
    }'
EOF

Can we have different annotation key for defining ovs bridge name like below ?

  annotations:
    k8s.v1.cni.cncf.io/bridgeName: ovs-cni.network.kubevirt.io/br1
    k8s.v1.cni.cncf.io/resourceName: intel.com/mellanox_snic0

Then we can enhance network resources injector to support both resources and pod spec can be injected with appropriate resource requests/limits.

phoracek commented 4 years ago

We cannot change the annotation name, the code that consumes the annotation is not owned by us (the resource injector). Can you just omit the br1 resource name? It should still work as long as you have the bridge configured on all nodes with intel.com/mellanox_snic0.

pperiyasamy commented 4 years ago

@phoracek Yes, the ovs bridge name can be omitted from net-attach-def, but there are some worker nodes which may not have that ovs bridge. Hence it's needed to inject ovs bridge into pod spec as resource requirements so that kube scheduler can place it on the right node.

phoracek commented 4 years ago

I would accept a PR for the marker to expose additional resource labels.

However, you would need to introduce your own webhook injecting the resource request to Pods (like I said, we don't own the resource injector webhook).

Alternative to the webhook would be if you put the resource requirement on your Pods manually.

phoracek commented 4 years ago

If you insert the resource request manually, we don't need any change to the marker.

JanScheurich commented 4 years ago

For Mellanox SmartNIC network attachment definitions there is no need to refer to the OVS bridge name, as the OVS bridge is connected to the Linux bond of the two SmartNIC PFs, while all the SmartNIC VFs are configured to be part of the device pool for the SmartNIC. OVS bridge and device pool are "synonymous" so that referring to the VF device pool guarantees injection of the VF resource into the pod spec as well as scheduling of the pod to a suitable node.

I assume the introduction of the OVS CNI marker resource was a "trick" to solve the scheduling of pods in case of heterogeneous K8s nodes without the need for using node selector labels. Have you considered using simple resource (node selector) labels instead of a dummy bridge resource instead (with a new/enhanced webhook)?

phoracek commented 4 years ago

We have considered node selectors. However, we wanted to reuse the existing mechanism available through the resource injector, so we don't need to reinvent new webhooks.

phoracek commented 4 years ago

In KubeVirt, we use resource injector to handle scheduling for SR-IOV and linux bridge (https://github.com/kubevirt/bridge-marker) too.

pperiyasamy commented 4 years ago

raised PR in network resources injector

moshe010 commented 4 years ago

@phoracek why did you use nfd [1] with Feature Detector Hooks. can you explain why ovs bridge is countable resource? [1] - https://github.com/kubernetes-sigs/node-feature-discovery#feature-labels

JanScheurich commented 4 years ago

Modelling OVS bridges as node features looks like an interesting alternative to the trick with countable bridge resources. However, there are two aspects of the current bridge resource approach that I would like to keep:

  1. The marker monitors OVS bridges on the nodes and creates new bridge resources as needed. Could the node feature detector do something similar?
  2. The Network Resources Injector automatically adds the bridge resource requirement to the pod spec so that applications do not need to specify node selector labels to solve the scheduling. Could we enhance the Network Resources Injector to be able to also inject node selector labels?
moshe010 commented 4 years ago

so can you explain how the marker work today? is it create/remove bridges per adding/delete network in the network attachment?

pperiyasamy commented 4 years ago

@moshe010 AFAIK ovs-cni marker would expose ovs bridge as a scalar resource with capacity of 1k and net-attach-def object can refer to it via k8s.v1.cni.cncf.io/resourceName annotation. This helps k8 scheduler to place the pod on the node where the particular ovs bridge exists, but it's never used by device plugin or multus/ovs-cni.

As @JanScheurich mentioned, it would be more appropriate if ovs-cni marker exposes ovs bridges as a node feature and use node selector label in network resources injector would be an appropriate solution.

@phoracek WDYT ?

pperiyasamy commented 4 years ago

/cc @zshi-redhat

phoracek commented 4 years ago

@phoracek why did you use nfd [1] with Feature Detector Hooks. can you explain why ovs bridge is countable resource? [1] - https://github.com/kubernetes-sigs/node-feature-discovery#feature-labels

You meant to ask why did not we use it?

We considered NFD in the beginning, however, it has a few drawbacks. It would require us to deploy NFD components and it would require us to write a network-label-injector webhook. Writting webhooks is difficult due to certificate handling and deploying NFD may be complicated since it is not a trivial component.

Since the resource handling way of scheduling was already employed in SR-IOV, we decided to use it even with OVS and Linux bridge CNIs. All we need is a simple Daemon Set exposing found bridges as resources on the node. Resource injector is already deployed on the cluster for us (handled through kubevirt or SR-IOV).

We chose to follow the simpler way, despite the API is maybe not as good as it would be with NFD.

@moshe010 AFAIK ovs-cni marker would expose ovs bridge as a scalar resource with capacity of 1k and net-attach-def object can refer to it via k8s.v1.cni.cncf.io/resourceName annotation. This helps k8 scheduler to place the pod on the node where the particular ovs bridge exists, but it's never used by device plugin or multus/ovs-cni.

As @JanScheurich mentioned, it would be more appropriate if ovs-cni marker exposes ovs bridges as a node feature and use node selector label in network resources injector would be an appropriate solution.

@phoracek WDYT ?

I'm really hesitant to switch to NFD now. We don't have any issues with the current approach. If somebody is willing to contribute NFD support, including the injector webhook, I would be willing do discuss it. But like I said, I would be really hesitant unless we have a better reason than that it would look better.

Note that you can decide to just deploy OVS CNI without the marker and handle the scheduling in any way you like.

tl;dr: Why do we want to switch from the current approach?

JanScheurich commented 4 years ago

Maybe a stupid question: Can't we change to using node-selector labels in a cheaper way? The OVS CNI marker could just add bridge node-selector labels to the node resource instead of creating dummy bridge resources. And the Network Resources Injector web-hook could be enhanced to check for label annotations in NADs and inject them as node-selector labels into the pod spec.

phoracek commented 4 years ago

There are no dummy bridge resources. We don't use DP. We just merely expose the available bridge as a node resource, not countable: https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L115 I don't see what is so expensive here.

If the resource injector works with labels, that would be great. But I don't see a big difference between exposing the value in resource or label field. But then, I don't know that much about Kubernetes internals.

phoracek commented 4 years ago

But maybe I just don't understand "The OVS CNI marker could just add bridge node-selector labels to the node resource instead of creating dummy bridge resources." properly.

JanScheurich commented 4 years ago

There are no dummy bridge resources. We don't use DP. We just merely expose the available bridge as a node resource, not countable: https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L115 I don't see what is so expensive here.

If I'm not mistaken each bridge is represented as a countable node resource with 1000 units: https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L34

Yes, this works (as long as there are not more than 1000 ports per bridge), but it still is trick as there is no actual well-defined limit of ports that can be connected to an OVS bridge.

If the resource injector works with labels, that would be great. But I don't see a big difference between exposing the value in resource or label field. But then, I don't know that much about Kubernetes internals.

It would just be conceptually cleaner than consuming pseudo-resources. I do understand the original reasoning for re-using the existing Network Resource Injector scheme, but perhaps we can improve that by allowing injection of labels as well.

phoracek commented 4 years ago

If I'm not mistaken each bridge is represented as a countable node resource with 1000 units: https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L34

It has available amount of 1000, but it is not list of 1000 devices as it would be in case of DP where each item represents a device. Instead, it is 1000 as in 1 kilo, as if it was amount of available memory. Therefore I believe there is not much overhead present.

Yes, this works (as long as there are not more than 1000 ports per bridge), but it still is trick as there is no actual well-defined limit of ports that can be connected to an OVS bridge.

The maximum limit is not an issue. Is it? I doubt there would be 1000 connections to a single bridge on a node. It is simply implicit "infinite".

It would just be conceptually cleaner than consuming pseudo-resources. I do understand the original reasoning for re-using the existing Network Resource Injector scheme, but perhaps we can improve that by allowing injection of labels as well.

It would absolutely be cleaner. I would like to hear more about the needed changes. If I have a clean Kubernetes cluster, what would I need to do to get NFD based solution running. How would we deliver an NFD hook etc. I'm not strictly against this, I just want to be sure the effort/outcome ratio is reasonable.

JanScheurich commented 4 years ago

@phoracek: I didn't mean to imply that there was a lot of overhead with modelling bridge as resource.

My thought was if we could do the label approach without the dependency on NFD and by still relying on a (slightly enhanced) Network Resources Injector.

phoracek commented 4 years ago

(Sorry if I look annoyed and push back too hard, I don't mean to)

@JanScheurich could you summarize the benefits of such a change? Anything apart from getting rid of the countable API which does not make sense for "uncountable" resources?

BTW, even though we don't implement QoS now, the countable approach may be helpful once we decide to. Let users request 1GB of the throughput.

kubevirt-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot commented 4 years ago

@kubevirt-bot: Closing this issue.

In response to [this](https://github.com/kubevirt/ovs-cni/issues/107#issuecomment-667247024): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.