kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.95k stars 442 forks source link

[BUG] Performance about Service & NetworkPolicy #4605

Open zsxsoft opened 2 hours ago

zsxsoft commented 2 hours ago

Kube-OVN Version

v1.12.26

Kubernetes Version

v1.27.4

Operation-system/Kernel Version

TencentOS Server 4.0 6.6.6-2401.0.1.tl4.4.x86_64

Description

I have a cluster with ~300 Pods and ~100 NetworkPolicies. I've noticed that every time I create a Service, a significant number of UpdateNp logs are added to the kube-ovn-controller.log, and at the same time, the Dashboard shows a Work Queue Latency reaching about 1 minute.

Then I checked the code: https://github.com/kubeovn/kube-ovn/blob/v1.12.26/pkg/controller/network_policy.go#L855-L878

The above code seems to indicate that whenever a Service is created, all Pods within the corresponding Namespace are retrieved, and then all NetworkPolicies are matched to create an 'UpdateNp' queue. This not only results in an O(n^2) time complexity, in my cluster, is equivalent to updating all NetworkPolicies

https://github.com/kubeovn/kube-ovn/pull/1167

Since Pods already match and respond to UpdateNp requests when they are created, and no additional operations are performed on NetworkPolicies during the creation of a Service, I don't quite understand the purpose here. Moreover, in large-scale clusters, creating/deleting Services should be a relatively common operation. In this case, I'm afraid that all requests for creating/deleting Pods will be blocked after the creation of a Service.

Is there any solution?

Steps To Reproduce

  1. Create a large number of NetworkPolicies and Pods
  2. Create/Delete a Service.

Current Behavior

/

Expected Behavior

/

dosubot[bot] commented 2 hours ago

Hey @zsxsoft, I'm here to help you with any bugs, questions, or becoming a contributor! 😊

To address the performance issue related to Service creation and NetworkPolicy updates in Kube-OVN, you can consider the following optimizations:

  1. Increase Probe Interval: Adjusting the default probe interval for large clusters can help reduce the load and improve performance. This change has been implemented in a recent update [1].

  2. Optimize Protocol Checks: The protocol check in NetworkPolicy updates has been streamlined to reduce latency, which is part of the recent optimizations [1].

  3. Review Pull Request #1167: This pull request specifically addresses the O(n^2) complexity issue by optimizing how Services interact with NetworkPolicies. It suggests that since Pods already handle UpdateNp requests upon creation, the additional processing during Service creation might be unnecessary. Reviewing the changes in this pull request could provide insights into further optimizations.

  4. Consider Caching Mechanisms: Implementing caching mechanisms could help reduce the need to re-evaluate all NetworkPolicies when a Service is created, thus improving performance.

These steps aim to mitigate the high latency and potential blocking of operations in large clusters.

To continue talking to Dosu, mention @dosu.