submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.43k stars 193 forks source link

Gateway migration in one cluster is causing failures in remote cluster #1544

Closed sridhargaddam closed 2 years ago

sridhargaddam commented 3 years ago

What happened: In a two-cluster setup (perf1 and perf2), install Submariner (with NAT disabled) to connect both the clusters and ensure that connections are successfully established. In such a setup, when the active gateway migrates on the Perf1 cluster, the routes in table 150 on the Active Gateway node of Perf2 are accidentally deleted. Because of this, pinger (aka Healthcheck) which runs on Perf2 gateway is unable to reach Perf1 health-check IP. Along with this, Hostnetworking to remoteCluster from Perf2 is broken.

What you expected to happen: When the active Gateway migrates, the HostNetworking use-cases should continue to work.

A temporary work-around is to restart the route-agent running on the active Gateway node of the Perf2 cluster.

Environment: Submariner version: v0.9 Kubernetes Server version: v1.21.1+9807387 CNI: OpenShiftSDN on VmWare Clusters

Logs from route-agent running on the Active Gateway node of Perf2 cluster are attached: api-perf2-chris-ocs-ninja_6443_submariner-routeagent-xzx8f.log

sridhargaddam commented 3 years ago

On analysing the logs, the issue seems to be some race condition in handling the endpoint events.

On Perf2 route-agent pod on Gateway node: The route-agent gets new endpoint created notification of Perf1, subsequently it gets a notification for endpoint removal for the old endpoint during which it was deleting the entries in routing table 150 which was causing this issue.

I0901 00:34:03.118840       1 vxlan.go:186] Successfully added the bridge fdb entry 10.70.56.183 00:00:00:00:00:00
I0901 00:34:03.118922       1 vxlan.go:271] Successfully configured reverse path filter to loose mode on "vx-submariner"
I0901 00:34:06.583558       1 handler.go:69] A new Endpoint for remote cluster "ocp4perf1" has been created: v1.EndpointSpec{ClusterID:"ocp4perf1", CableName:"submariner-cable-ocp4perf1-10-70-56-242", HealthCheckIP:"10.5.10.1", Hostname:"perf1-4gqrk-worker-bw729", Subnets:[]string{"10.15.0.0/16", "10.5.0.0/16"}, PrivateIP:"10.70.56.242", PublicIP:"125.16.100.118", NATEnabled:true, Backend:"libreswan", BackendConfig:map[string]string{"preferred-server":"false", "udp-port":"4500"}}
I0901 00:34:06.583604       1 routes_iface.go:249] On GWNode, in updateRoutingRulesForInterClusterSupport ignoring
I0901 00:34:06.586191       1 iptables_iface.go:105] Installing iptables rule for outgoing traffic: -s 10.6.0.0/16 -d 10.15.0.0/16 -j ACCEPT
I0901 00:34:06.589602       1 iptables_iface.go:113] Installing iptables rule for incoming traffic: -s 10.15.0.0/16 -d 10.6.0.0/16 -j ACCEPT
I0901 00:34:06.607762       1 iptables_iface.go:105] Installing iptables rule for outgoing traffic: -s 10.6.0.0/16 -d 10.5.0.0/16 -j ACCEPT
I0901 00:34:06.651785       1 iptables_iface.go:113] Installing iptables rule for incoming traffic: -s 10.5.0.0/16 -d 10.6.0.0/16 -j ACCEPT
I0901 00:34:06.657487       1 ipset.go:351] Running ipset [add SUBMARINER-REMOTECIDRS 10.15.0.0/16 -exist]
I0901 00:34:06.661841       1 ipset.go:351] Running ipset [add SUBMARINER-REMOTECIDRS 10.5.0.0/16 -exist]
I0901 00:34:06.663393       1 handler.go:81] A new Endpoint for remote cluster "ocp4perf1" has been removed: v1.EndpointSpec{ClusterID:"ocp4perf1", CableName:"submariner-cable-ocp4perf1-10-70-56-199", HealthCheckIP:"10.5.8.1", Hostname:"perf1-4gqrk-worker-dfd5n", Subnets:[]string{"10.15.0.0/16", "10.5.0.0/16"}, PrivateIP:"10.70.56.199", PublicIP:"125.16.100.118", NATEnabled:false, Backend:"libreswan", BackendConfig:map[string]string{"preferred-server":"false", "udp-port":"4500"}}
I0901 00:34:06.663432       1 routes_iface.go:249] On GWNode, in updateRoutingRulesForInterClusterSupport ignoring
nyechiel commented 3 years ago

@aswinsuryan are you looking into this?

aswinsuryan commented 3 years ago

@nyechiel I am currently looking into submarriner-addon - cloud-prepare integration now. After completing it I can have a look at this one.

sridhargaddam commented 3 years ago

This issue is mostly reproduced in an OCP Setup and I've seen multiple users reporting/facing this problem with Submariner 0.9 release. I think this is still applicable with latest Submariner release. As this is affecting the datapath even though the clusters are shown as connected, marking the priority as high.

nyechiel commented 2 years ago

@aswinsuryan can you update what is the status of this one?