OCP 4.11+OVN: connections are slow to come up

nyechiel commented 2 years ago

What happened:

subctl deployment of two OCP 4.11 on AWS using the OVNKubernetes network plugin. Connections (libreswan) are coming up eventually, but it took more than 5 minutes. subctl diagnose output might give us a clue (the same error is seen on both clusters):


 ✓ Non-Globalnet deployment detected - checking if cluster CIDRs overlap 
 ✓ Clusters do not have overlapping CIDRs
 ⚠ Checking Submariner pods 
 ⚠ Pod "submariner-gateway-jqrtn" has restarted 5 times
 ✓ All Submariner pods are up and running

What you expected to happen:

Connections to come up faster. No errors in subctl diagnose output.

How to reproduce it (as minimally and precisely as possible):

Deploy OCP 4.11 on AWS clusters. Modify the install-configs to ensure non-overlapping CIDRs and OVNKubernetes network plugin.
Run cloud-prepare commands
Run join commands (I used the default flags)

Anything else we need to know?:

Submariner 0.13.0-rc1
OCP 4.11.0-fc.3 (Kubernetes Version: v1.24.0+284d62a)
Once the tunnels are up, E2E tests are passing.

gather.zip

sridhargaddam commented 2 years ago

The Gateway pod was restarted around 5 times and the following is seen in the logs of the pod just before it got restarted for the last time.

It appears like while the Gateway pod is trying to setup IPsec tunnels to remote cluster, there is a brief interval when the connections to the K8s API Server are broken. It seems to recover once everything is settled.

I0705 06:28:05.900026       1 main.go:93] Starting the submariner gateway engine
W0705 06:28:05.900228       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0705 06:28:05.903034       1 main.go:115] Creating the cable engine
F0705 06:28:35.904043       1 local_endpoint.go:48] Error getting information on the local node: unable to find local node "ip-10-0-2-152.us-east-2.compute.internal": Get "https://172.30.0.1:443/api/v1/nodes/ip-10-0-2-152.us-east-2.compute.internal": dial tcp 172.30.0.1:443: i/o timeout

Even the route-agent pod running on the same node was restarted 3 times and the logs show similar errors.

I0705 06:28:03.966980       1 registry.go:65] Event handler "MTU handler" added to registry "routeagent_driver".
I0705 06:28:03.967189       1 cni_iface.go:72] Interface "lo" has "127.0.0.1" address
I0705 06:28:03.967276       1 cni_iface.go:72] Interface "br-ex" has "10.0.2.152" address
I0705 06:28:03.967350       1 cni_iface.go:72] Interface "ovn-k8s-mp0" has "10.130.2.2" address
I0705 06:28:03.967360       1 cni_iface.go:77] Found CNI Interface "ovn-k8s-mp0" that has IP "10.130.2.2" from ClusterCIDR "10.128.0.0/14"
E0705 06:28:33.968480       1 main.go:115] Error while annotating the node: error annotating node with CNI interface IP: error updatating node "ip-10-0-2-152.us-east-2.compute.internal": unable to get node info for node "ip-10-0-2-152.us-east-2.compute.internal": Get "https://172.30.0.1:443/api/v1/nodes/ip-10-0-2-152.us-east-2.compute.internal": dial tcp 172.30.0.1:443: i/o timeout

Even on the second OVN Cluster, its the same error. Gateway pod logs (previous logs):

F0705 06:29:14.005278       1 local_endpoint.go:48] Error getting information on the local node: unable to find local node "ip-10-0-62-17.us-east-2.compute.internal": Get "https://172.31.0.1:443/api/v1/nodes/ip-10-0-62-17.us-east-2.compute.internal": dial tcp 172.31.0.1:443: i/o timeout

Route-agent pod running on the same node:

I0705 06:27:17.460478       1 registry.go:65] Event handler "MTU handler" added to registry "routeagent_driver".
I0705 06:27:17.460681       1 cni_iface.go:72] Interface "lo" has "127.0.0.1" address
I0705 06:27:17.460766       1 cni_iface.go:72] Interface "br-ex" has "10.0.62.17" address
I0705 06:27:17.460842       1 cni_iface.go:72] Interface "ovn-k8s-mp0" has "10.134.2.2" address
I0705 06:27:17.460878       1 cni_iface.go:77] Found CNI Interface "ovn-k8s-mp0" that has IP "10.134.2.2" from ClusterCIDR "10.132.0.0/14"
E0705 06:27:47.461130       1 main.go:115] Error while annotating the node: error annotating node with CNI interface IP: error updatating node "ip-10-0-62-17.us-east-2.compute.internal": unable to get node info for node "ip-10-0-62-17.us-east-2.compute.internal": Get "https://172.31.0.1:443/api/v1/nodes/ip-10-0-62-17.us-east-2.compute.internal": dial tcp 172.31.0.1:443: i/o timeout
W0705 06:27:47.461245       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
F0705 06:28:17.462169       1 main.go:124] Error creating controller for event handling error creating resource watcher: error building the REST mapper: error retrieving API group resources: Get "https://172.31.0.1:443/api?timeout=32s": dial tcp 172.31.0.1:443: i/o timeout

nyechiel commented 2 years ago

@aswinsuryan @astoycos have you noticed this before?

astoycos commented 2 years ago

I would need to see the full MG/ OVN-k / API server logs at first glance It seems like OVN-K is down for a period of time here causing other things to fail since the API server traffic should be going through the CNI in this case I believe.

Maybe a dump question :) but Were both the clusters up successfully (and healthy) before running join?

nyechiel commented 2 years ago

It looks like OVN-K might be a little slower to finish coming up comparing to OpenShiftSDN. Will try to reproduce and capture full OCP must-gather logs to confirm.

nyechiel commented 2 years ago

I reran some tests and indeed it looks like the clusters were not fully up during the initial join process.

submariner-io / submariner

OCP 4.11+OVN: connections are slow to come up #1909