Latency Issues on okd 4.5

mikonse commented 4 years ago

Hi there, when migrating our okd 4.5 cluster from the openshift sdn to ovn-kubernetes we are running into a couple problems. One of those is that we are experiencing varying network latencies in our cluster. More specifically, we encountered the problem with our installed operators as querying the kubernetes/openshift API sometimes has such a large delay/latency that some of the operators crash due to connection timeouts in their API connections.

I wrote a small poller container that checks the latency to the kube api every few seconds and received the following results: Most of the time the latency is normal at around 5-10ms, while sometimes increasing to 3-5s every couple of minutes or so. On exception the latency even increases to above 10s which is where some of the operators run into timeouts for their API connections. Here is a small log:

2020-08-26 06:35:38,884 | Querying kube api took 1.0134170055389404 seconds, mean: 1.0291468663649126, std: 1.2074876503091008
2020-08-26 06:35:43,894 | Querying kube api took 0.006404399871826172 seconds, mean: 0.9439183274904887, std: 1.1885476073072725
2020-08-26 06:35:48,904 | Querying kube api took 0.006301164627075195 seconds, mean: 0.8717939303471491, std: 1.167283185958872
2020-08-26 06:35:56,964 | Querying kube api took 3.0570592880249023 seconds, mean: 1.0278843130384172, std: 1.2644514229247072
2020-08-26 06:36:01,973 | Querying kube api took 0.007592916488647461 seconds, mean: 0.9598648866017659, std: 1.2466091468589626
2020-08-26 06:36:06,983 | Querying kube api took 0.006791353225708008 seconds, mean: 0.9002977907657623, std: 1.2276823272050628
2020-08-26 06:36:11,993 | Querying kube api took 0.006376981735229492 seconds, mean: 0.8477142137639663, std: 1.2083084071103605
2020-08-26 06:36:17,000 | Querying kube api took 0.006534099578857422 seconds, mean: 0.8009819851981269, std: 1.1888803697358818
2020-08-26 06:36:23,030 | Querying kube api took 1.0255475044250488 seconds, mean: 0.8128012230521754, std: 1.1565320898986633
2020-08-26 06:36:31,094 | Querying kube api took 3.059217691421509 seconds, mean: 0.9251220464706421, std: 1.232674972127253
2020-08-26 06:36:37,121 | Querying kube api took 1.0260138511657715 seconds, mean: 0.9299264181227911, std: 1.2016646492636327
2020-08-26 06:36:43,137 | Querying kube api took 1.0149710178375244 seconds, mean: 0.9337920817461881, std: 1.1728447795632255
2020-08-26 06:36:53,220 | Querying kube api took 5.078668832778931 seconds, mean: 1.1140041143997856, std: 1.4352685689884035
2020-08-26 06:36:59,265 | Querying kube api took 1.0434315204620361 seconds, mean: 1.1110635896523793, std: 1.4037942683922056
2020-08-26 06:37:05,285 | Querying kube api took 1.0185487270355225 seconds, mean: 1.107362995147705, std: 1.3743619526156705
2020-08-26 06:37:11,301 | Querying kube api took 1.011831283569336 seconds, mean: 1.103688698548537, std: 1.3467245292072987
2020-08-26 06:37:19,360 | Querying kube api took 3.054666042327881 seconds, mean: 1.1759471186885126, std: 1.3729111685444382
2020-08-26 06:37:25,376 | Querying kube api took 1.0113580226898193 seconds, mean: 1.1700689366885595, std: 1.3476060266185836
2020-08-26 06:37:40,385 | Got error HTTPSConnectionPool(host='100.123.0.1', port=443): Max retries exceeded with url: /api/v1 (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f5bc2073f60>, 'Connection to 100.123.0.1 timed out. (connect timeout=10)')) when querying kube api
2020-08-26 06:37:46,422 | Querying kube api took 1.032435655593872 seconds, mean: 1.1653229614783978, std: 1.3235696295077828
2020-08-26 06:37:56,518 | Querying kube api took 5.090970516204834 seconds, mean: 1.296177879969279, std: 1.4849642075041256
2020-08-26 06:38:01,530 | Querying kube api took 0.007029294967651367 seconds, mean: 1.25459244174342, std: 1.4782505030705329
2020-08-26 06:38:09,601 | Querying kube api took 3.0661325454711914 seconds, mean: 1.3112030699849129, std: 1.4890553578273649
2020-08-26 06:38:15,621 | Querying kube api took 1.0162427425384521 seconds, mean: 1.302264878244111, std: 1.4665033540887158
2020-08-26 06:38:22,646 | Querying kube api took 2.02026629447937 seconds, mean: 1.3233825669569128, std: 1.4493529413628203

When deploying our cluster with the completely same settings, only swapping ovn-kubernetes with the openshift-sdn network provider everything is working fine and the latency is constant at the 5-10ms mark.

So far the only irregularities that I could find are a very frequent occurrence of

I0826 06:37:11.406851       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0826 06:37:18.574245       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

in the openshift apiserver logs, which to me indicates that something outside both the apiserver and the api client is forcibly closing the connections. As far as I could see the ovn and ovs logs looked normal. I am currently also testing the latency with custom deployments, will update this issue as soon as I have some results.

Any ideas/pointers on how to debug this, or did somebody experience similar issues?

Edit: Small update: When trying to reproduce the issues with self-deployed pods and measuring latency between those the latency was constant at the normal 5-10ms. I tested it with a simple nginx pod being queried both when running on the same node as well as on different ones. It seems like the network problems only arise when querying the openshift API through its hard coded IP.

jomeier commented 4 years ago

@mikonse Could you attach your test script? It might be helpful to reproduce the problem.

sst1xx commented 2 years ago

Hi, it seems that we have the same problem after migration from openshift sdn to ovn-kubernetes CNI plugin. OKD 4.7.0-0.okd-2021-08-22-163618. if we remove all network policies from namespace everything goes smoothly. I attached 3 network policy where we have face a problem

@mikonse could you please share your script with @jomeier ? May be it will help to detect the root cause.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-openshift-ingress
  namespace: performance-test-ovn
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          network.openshift.io/policy-group: ingress
  podSelector: {}
  policyTypes:
  - Ingress

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-openshift-monitoring
  namespace: performance-test-ovn
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          network.openshift.io/policy-group: monitoring
  podSelector: {}
  policyTypes:
  - Ingress

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: performance-test-ovn
spec:
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 week ago

This issue was closed because it has been stalled for 5 days with no activity.

ovn-org / ovn-kubernetes

Latency Issues on okd 4.5 #1638