Closed nyechiel closed 2 years ago
The Gateway pod was restarted around 5 times and the following is seen in the logs of the pod just before it got restarted for the last time.
It appears like while the Gateway pod is trying to setup IPsec tunnels to remote cluster, there is a brief interval when the connections to the K8s API Server are broken. It seems to recover once everything is settled.
I0705 06:28:05.900026 1 main.go:93] Starting the submariner gateway engine
W0705 06:28:05.900228 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0705 06:28:05.903034 1 main.go:115] Creating the cable engine
F0705 06:28:35.904043 1 local_endpoint.go:48] Error getting information on the local node: unable to find local node "ip-10-0-2-152.us-east-2.compute.internal": Get "https://172.30.0.1:443/api/v1/nodes/ip-10-0-2-152.us-east-2.compute.internal": dial tcp 172.30.0.1:443: i/o timeout
Even the route-agent pod running on the same node was restarted 3 times and the logs show similar errors.
I0705 06:28:03.966980 1 registry.go:65] Event handler "MTU handler" added to registry "routeagent_driver".
I0705 06:28:03.967189 1 cni_iface.go:72] Interface "lo" has "127.0.0.1" address
I0705 06:28:03.967276 1 cni_iface.go:72] Interface "br-ex" has "10.0.2.152" address
I0705 06:28:03.967350 1 cni_iface.go:72] Interface "ovn-k8s-mp0" has "10.130.2.2" address
I0705 06:28:03.967360 1 cni_iface.go:77] Found CNI Interface "ovn-k8s-mp0" that has IP "10.130.2.2" from ClusterCIDR "10.128.0.0/14"
E0705 06:28:33.968480 1 main.go:115] Error while annotating the node: error annotating node with CNI interface IP: error updatating node "ip-10-0-2-152.us-east-2.compute.internal": unable to get node info for node "ip-10-0-2-152.us-east-2.compute.internal": Get "https://172.30.0.1:443/api/v1/nodes/ip-10-0-2-152.us-east-2.compute.internal": dial tcp 172.30.0.1:443: i/o timeout
Even on the second OVN Cluster, its the same error. Gateway pod logs (previous logs):
F0705 06:29:14.005278 1 local_endpoint.go:48] Error getting information on the local node: unable to find local node "ip-10-0-62-17.us-east-2.compute.internal": Get "https://172.31.0.1:443/api/v1/nodes/ip-10-0-62-17.us-east-2.compute.internal": dial tcp 172.31.0.1:443: i/o timeout
Route-agent pod running on the same node:
I0705 06:27:17.460478 1 registry.go:65] Event handler "MTU handler" added to registry "routeagent_driver".
I0705 06:27:17.460681 1 cni_iface.go:72] Interface "lo" has "127.0.0.1" address
I0705 06:27:17.460766 1 cni_iface.go:72] Interface "br-ex" has "10.0.62.17" address
I0705 06:27:17.460842 1 cni_iface.go:72] Interface "ovn-k8s-mp0" has "10.134.2.2" address
I0705 06:27:17.460878 1 cni_iface.go:77] Found CNI Interface "ovn-k8s-mp0" that has IP "10.134.2.2" from ClusterCIDR "10.132.0.0/14"
E0705 06:27:47.461130 1 main.go:115] Error while annotating the node: error annotating node with CNI interface IP: error updatating node "ip-10-0-62-17.us-east-2.compute.internal": unable to get node info for node "ip-10-0-62-17.us-east-2.compute.internal": Get "https://172.31.0.1:443/api/v1/nodes/ip-10-0-62-17.us-east-2.compute.internal": dial tcp 172.31.0.1:443: i/o timeout
W0705 06:27:47.461245 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
F0705 06:28:17.462169 1 main.go:124] Error creating controller for event handling error creating resource watcher: error building the REST mapper: error retrieving API group resources: Get "https://172.31.0.1:443/api?timeout=32s": dial tcp 172.31.0.1:443: i/o timeout
@aswinsuryan @astoycos have you noticed this before?
I would need to see the full MG/ OVN-k / API server logs at first glance It seems like OVN-K is down for a period of time here causing other things to fail since the API server traffic should be going through the CNI in this case I believe.
Maybe a dump question :) but Were both the clusters up successfully (and healthy) before running join?
It looks like OVN-K might be a little slower to finish coming up comparing to OpenShiftSDN. Will try to reproduce and capture full OCP must-gather logs to confirm.
I reran some tests and indeed it looks like the clusters were not fully up during the initial join
process.
What happened:
subctl
deployment of two OCP 4.11 on AWS using the OVNKubernetes network plugin. Connections (libreswan) are coming up eventually, but it took more than 5 minutes.subctl diagnose
output might give us a clue (the same error is seen on both clusters):What you expected to happen:
Connections to come up faster. No errors in
subctl diagnose
output.How to reproduce it (as minimally and precisely as possible):
cloud-prepare
commandsjoin
commands (I used the default flags)Anything else we need to know?:
gather.zip