submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.44k stars 193 forks source link

Submariner "broke" a spoke OVNKubernetes ocp cluster #1625

Closed josecastillolema closed 2 years ago

josecastillolema commented 2 years ago

What happened: We have two clusters:

Both clusters have overlapping CIDRs. At the time of joining test595 to the broker, we were not aware of OVNKubernetes overlapping CIDR limitation. The join went ahead with no warnings about the overlapping, submariner did not work (as expected) and after uninstalling submariner following instructions in https://submariner.io/operations/cleanup/ the cluster seems broken. We see this messages in some pods, referring to the okdhub cluster: E1216 09:50:04.865163 1 runtime.go:78] Observed a panic: &errors.errorString{s:"unable to calculate an index entry for key \"default/demo-okdhub\" on index \"ServiceName\": endpointSlice missing kubernetes.io/service-name label"} (unable to calculate an index entry for key "default/demo-okdhub" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label)

What you expected to happen: The join should have failed, preventing the installation as there are overlapping CIDRs. Also, the cleanup should have worked, and not left the cluster in an inconsistent state.

How to reproduce it (as minimally and precisely as possible): 1 Create an OVNKubernetes OCP broker and spoke cluster 2 Create a second OVNKubernetes OCP cluster with overlapping IPs 3 Join the second cluster to the broker

Environment:

sridhargaddam commented 2 years ago

Thanks @josecastillolema for reporting the issue.

sridhargaddam commented 2 years ago

A related fix was done earlier via the following PR and was backported as well - https://github.com/submariner-io/submariner/pull/1432

Have to investigate if some additional check is missing for OVN-Kubernetes implementation.

nyechiel commented 2 years ago

@josecastillolema can you please confirm what version of subctl/Submariner you are using?

josecastillolema commented 2 years ago

subctl version: v0.11.0

nyechiel commented 2 years ago

@josecastillolema how important this is for you? We have other priorities at the moment, so unless this is blocking your work, I wanted to suggest that we follow up on this as part of our next version (Submariner 0.13)

josecastillolema commented 2 years ago

No problem @nyechiel just wanted to make sure to write this down so we can track it. Thanks

astoycos commented 2 years ago

/cc @astoycos

maayanf24 commented 2 years ago

need to retest if this bug is still valid

sridhargaddam commented 2 years ago

I verified this issue with the latest submariner devel image on an OCP 4.11 cluster and following are the observations.

[sgaddam@localhost aws-ocp]$ oc version
Client Version: 4.9.11
Server Version: 4.11.0-0.nightly-2022-05-18-171831
Kubernetes Version: v1.23.3+69213f8

Output from subctl diagnose command

[sgaddam@localhost aws-ocp]$ subctl diagnose all
Cluster "sgaddam-aws-spoke1"
 ✓ Checking Submariner support for the Kubernetes version 
 ✓ Kubernetes version "v1.23.3+69213f8" is supported

 ✓ Checking Submariner support for the CNI network plugin
 ✓ The detected CNI network plugin ("OVNKubernetes") is supported
 ✓ Trying to detect the Calico ConfigMap 

 ✓ Checking gateway connections 
 ✓ All connections are established

 ✗ Non-Globalnet deployment detected - checking if cluster CIDRs overlap 
 ✗ CIDR ["172.30.0.0/16" "sgaddam-aws-spoke2" "sgaddam-aws-spoke1" ["172.30.0.0/16" "10.128.0.0/14"]] in cluster %!q(MISSING) overlaps with cluster %!q(MISSING) (CIDRs: %!v(MISSING))
 ✗ CIDR ["10.128.0.0/14" "sgaddam-aws-spoke2" "sgaddam-aws-spoke1" ["172.30.0.0/16" "10.128.0.0/14"]] in cluster %!q(MISSING) overlaps with cluster %!q(MISSING) (CIDRs: %!v(MISSING))

 ✓ Checking Submariner support for the kube-proxy mode 
 ✓ The kube-proxy mode is supported

 ✓ Checking the firewall configuration to determine if the metrics port (8080) is allowed 
 ✓ The firewall configuration allows metrics to be retrieved from Gateway nodes

 ✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed 
 ✓ This check is not necessary for the OVNKubernetes CNI plugin
 ✓ The firewall configuration allows intra-cluster VXLAN traffic

 ✓ Globalnet is not installed - skipping

Skipping inter-cluster firewall check as it requires two kubeconfigs. Please run "subctl diagnose firewall inter-cluster" command manually.

After joining two clusters with Overlapping CIDRs on an OVNK cluster, I can see that both route-agent pod as well as network-plugin-syncer pod of Submariner are validating for overlapping CIDRs and ignoring the endpoint when it sees that CIDRs are overlapping.

Logs from submariner-networkplugin-syncer pod

 I0621 12:43:18.570041       1 handler.go:71] A new Endpoint for remote cluster "sgaddam-aws-spoke2" has been created: v1.EndpointSpec{ClusterID:"sgaddam-aws-spoke2", CableName:"submariner-cable-sgaddam-aws-spoke2-22-0-37-219", HealthCheckIP:"10.129.2.2", Hostname:"ip-22-0-37-219", Subnets:[]string{"172.30.0.0/16", "10.128.0.0/14"}, PrivateIP:"22.0.37.219", PublicIP:"13.59.92.39", NATEnabled:true, Backend:"libreswan", BackendConfig:map[string]string{"natt-discovery-port":"4490", "preferred-server":"false", "udp-port":"4500"}}
E0621 12:43:18.570083       1 handler.go:132] overlappingSubnets for new remote &v1.Endpoint{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"sgaddam-aws-spoke2-submariner-cable-sgaddam-aws-spoke2-22-0-37-219", GenerateName:"", Namespace:"submariner-operator", SelfLink:"", UID:"2bd50d4e-cd3a-4f85-95bb-c8df2c89c1c8", ResourceVersion:"54258", Generation:1, CreationTimestamp:time.Date(2022, time.June, 21, 12, 43, 18, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"submariner-io/clusterID":"sgaddam-aws-spoke2"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"submariner-gateway", Operation:"Update", APIVersion:"submariner.io/v1", Time:time.Date(2022, time.June, 21, 12, 43, 18, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0005b4d80), Subresource:""}}}, Spec:v1.EndpointSpec{ClusterID:"sgaddam-aws-spoke2", CableName:"submariner-cable-sgaddam-aws-spoke2-22-0-37-219", HealthCheckIP:"10.129.2.2", Hostname:"ip-22-0-37-219", Subnets:[]string{"172.30.0.0/16", "10.128.0.0/14"}, PrivateIP:"22.0.37.219", PublicIP:"13.59.92.39", NATEnabled:true, Backend:"libreswan", BackendConfig:map[string]string{"natt-discovery-port":"4490", "preferred-server":"false", "udp-port":"4500"}}} returned error: local Service CIDR "172.30.0.0/16", overlaps with remote cluster subnets [172.30.0.0/16 10.128.0.0/14]

Logs from submariner-routeagent pod

I0621 12:43:18.569409       1 handler.go:71] A new Endpoint for remote cluster "sgaddam-aws-spoke2" has been created: v1.EndpointSpec{ClusterID:"sgaddam-aws-spoke2", CableName:"submariner-cable-sgaddam-aws-spoke2-22-0-37-219", HealthCheckIP:"10.129.2.2", Hostname:"ip-22-0-37-219", Subnets:[]string{"172.30.0.0/16", "10.128.0.0/14"}, PrivateIP:"22.0.37.219", PublicIP:"13.59.92.39", NATEnabled:true, Backend:"libreswan", BackendConfig:map[string]string{"natt-discovery-port":"4490", "preferred-server":"false", "udp-port":"4500"}}
E0621 12:43:18.569436       1 handler.go:109] overlappingSubnets for new remote &v1.Endpoint{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"sgaddam-aws-spoke2-submariner-cable-sgaddam-aws-spoke2-22-0-37-219", GenerateName:"", Namespace:"submariner-operator", SelfLink:"", UID:"2bd50d4e-cd3a-4f85-95bb-c8df2c89c1c8", ResourceVersion:"54258", Generation:1, CreationTimestamp:time.Date(2022, time.June, 21, 12, 43, 18, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"submariner-io/clusterID":"sgaddam-aws-spoke2"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"submariner-gateway", Operation:"Update", APIVersion:"submariner.io/v1", Time:time.Date(2022, time.June, 21, 12, 43, 18, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc000f48438), Subresource:""}}}, Spec:v1.EndpointSpec{ClusterID:"sgaddam-aws-spoke2", CableName:"submariner-cable-sgaddam-aws-spoke2-22-0-37-219", HealthCheckIP:"10.129.2.2", Hostname:"ip-22-0-37-219", Subnets:[]string{"172.30.0.0/16", "10.128.0.0/14"}, PrivateIP:"22.0.37.219", PublicIP:"13.59.92.39", NATEnabled:true, Backend:"libreswan", BackendConfig:map[string]string{"natt-discovery-port":"4490", "preferred-server":"false", "udp-port":"4500"}}} returned error: local Service CIDR "172.30.0.0/16", overlaps with remote cluster subnets [172.30.0.0/16 10.128.0.0/14]

Then ran subctl uninstall ... on both the clusters without any errors.

[sgaddam@localhost aws-ocp]$ subctl uninstall -y
 ✓ Checking if the connectivity component is installed on cluster "sgaddam-aws-spoke2" 
 ✓ The connectivity component is installed on cluster "sgaddam-aws-spoke2"
 ✓ Deleting the Submariner resource - this may take some time 
 ✓ Checking if the broker component is installed on cluster "sgaddam-aws-spoke2" 
 ✓ The broker component is not installed on cluster "sgaddam-aws-spoke2"
 ✓ Deleting the Submariner cluster roles and bindings on cluster "sgaddam-aws-spoke2" 
 ✓ Deleted the "submariner-diagnose" cluster role and binding
 ✓ Deleted the "submariner-gateway" cluster role and binding
 ✓ Deleted the "submariner-globalnet" cluster role and binding
 ✓ Deleted the "submariner-lighthouse-agent" cluster role and binding
 ✓ Deleted the "submariner-lighthouse-coredns" cluster role and binding
 ✓ Deleted the "submariner-networkplugin-syncer" cluster role and binding
 ✓ Deleted the "submariner-operator" cluster role and binding
 ✓ Deleted the "submariner-routeagent" cluster role and binding
 ✓ Deleting the Submariner namespace "submariner-operator" on cluster "sgaddam-aws-spoke2" 
 ✓ Deleting the Submariner custom resource definitions on cluster "sgaddam-aws-spoke2" 
 ✓ Deleted the "brokers.submariner.io" custom resource definition
 ✓ Deleted the "clusterglobalegressips.submariner.io" custom resource definition
 ✓ Deleted the "clusters.submariner.io" custom resource definition
 ✓ Deleted the "endpoints.submariner.io" custom resource definition
 ✓ Deleted the "gateways.submariner.io" custom resource definition
 ✓ Deleted the "globalegressips.submariner.io" custom resource definition
 ✓ Deleted the "globalingressips.submariner.io" custom resource definition
 ✓ Deleted the "servicediscoveries.submariner.io" custom resource definition
 ✓ Deleted the "submariners.submariner.io" custom resource definition
 ✓ Unlabeling gateway nodes on cluster "sgaddam-aws-spoke2"

After Submariner is uninstalled on both the clusters, I deployed a nginx service and a client pod. The client pod was able to successfully reach the nginx service. I used this to verify that cluster is working fine even after uninstalling submariner.

@josecastillolema, for the following

What you expected to happen:

One thing that I noticed is that cableEngine is currently not validating overlapping CIDRs and is going ahead and setting up tunnels to the remote cluster. Although this is not causing an issue, it might be a good idea to add the check in cableEngine code. As this is not specific to OVN, we can report a separate issue for that and fix it.

sridhargaddam commented 2 years ago

@josecastillolema can I go ahead and close this issue as the necessary checks are now in place?

josecastillolema commented 2 years ago

Sure, great job.

nyechiel commented 2 years ago

Thanks @sridhargaddam :100: