submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.44k stars 193 forks source link

[help ]I'm having some problems deploying submariner in a private environment #1649

Closed ultrastdn closed 2 years ago

ultrastdn commented 2 years ago

I have two clusters, cluster-a and cluster-b, on two VMs, and the IP addresses of the two VMs can ping each other. Cluster-a is the location where the broker is deployed. Cluster-a and cluster-b are added to the broker (the -natt=false parameter is used). After the submariner-gateway is added successfully, the following log is displayed in the gateway log:

0105 06:48:08.979928       1 tunnel.go:74] Tunnel controller processing added or updated submariner Endpoint object: &v1.Endpoint{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cluster-a-submariner-cable-cluster-a-7-212-53-18", GenerateName:"", Namespace:"submariner-operator", SelfLink:"/apis/submariner.io/v1/namespaces/submariner-operator/endpoints/cluster-a-submariner-cable-cluster-a-7-212-53-18", UID:"e18d84d3-1d8c-48e9-bda5-51b1c0271609", ResourceVersion:"7665", Generation:1, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63776539920, loc:(*time.Location)(0x21f0940)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"submariner-gateway", Operation:"Update", APIVersion:"submariner.io/v1", Time:(*v1.Time)(0xc000017158), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc000017170)}}}, Spec:v1.EndpointSpec{ClusterID:"cluster-a", CableName:"submariner-cable-cluster-a-7-212-53-18", HealthCheckIP:"10.240.0.0", Hostname:"kwephisprm19248", Subnets:[]string{"10.1.0.0/16", "10.240.0.0/16"}, PrivateIP:"7.212.53.18", PublicIP:"7.212.67.110", NATEnabled:false, Backend:"libreswan", BackendConfig:map[string]string{"natt-discovery-port":"4490", "preferred-server":"false", "public-ip":"ipv4:7.212.67.110", "udp-port":"4500"}}}
I0105 06:48:08.979998       1 cableengine.go:206] Not installing cable for local cluster
I0105 06:48:08.980060       1 tunnel.go:74] Tunnel controller processing added or updated submariner Endpoint object: &v1.Endpoint{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cluster-b-submariner-cable-cluster-b-7-212-62-19", GenerateName:"", Namespace:"submariner-operator", SelfLink:"/apis/submariner.io/v1/namespaces/submariner-operator/endpoints/cluster-b-submariner-cable-cluster-b-7-212-62-19", UID:"0fc2faa6-92e7-4a16-b249-22219af4ce14", ResourceVersion:"8365", Generation:1, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63776540177, loc:(*time.Location)(0x21f0940)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"submariner-io/clusterID":"cluster-b"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"submariner-gateway", Operation:"Update", APIVersion:"submariner.io/v1", Time:(*v1.Time)(0xc0000174d0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0000174e8)}}}, Spec:v1.EndpointSpec{ClusterID:"cluster-b", CableName:"submariner-cable-cluster-b-7-212-62-19", HealthCheckIP:"10.244.0.0", Hostname:"kwephisprm19247", Subnets:[]string{"10.10.0.0/16", "10.244.0.0/16"}, PrivateIP:"7.212.62.19", PublicIP:"7.212.62.19", NATEnabled:false, Backend:"libreswan", BackendConfig:map[string]string{"natt-discovery-port":"4490", "preferred-server":"false", "public-ip":"ipv4:7.212.62.19", "udp-port":"4500"}}}
I0105 06:48:08.980114       1 cableengine.go:158] Found a pre-existing cable "submariner-cable-cluster-b-7-212-62-19" with timestamp "2021-12-31 09:36:17 +0000 UTC" that belongs to this cluster cluster-b
I0105 06:48:08.980141       1 cableengine.go:172] Connection info (IP: 7.212.62.19, NAT: false, BackendConfig: map[natt-discovery-port:4490 preferred-server:false public-ip:ipv4:7.212.62.19 udp-port:4500]) for cable "submariner-cable-cluster-b-7-212-62-19" is unchanged - not re-installing
I0105 06:48:13.489752       1 libreswan.go:215] Connection "submariner-cable-cluster-b-7-212-62-19-0-0" not found in active connections obtained from whack: map[], map[]
I0105 06:48:13.489777       1 libreswan.go:215] Connection "submariner-cable-cluster-b-7-212-62-19-0-1" not found in active connections obtained from whack: map[], map[]
I0105 06:48:13.489784       1 libreswan.go:215] Connection "submariner-cable-cluster-b-7-212-62-19-1-0" not found in active connections obtained from whack: map[], map[]
I0105 06:48:13.489791       1 libreswan.go:215] Connection "submariner-cable-cluster-b-7-212-62-19-1-1" not found in active connections obtained from whack: map[], map[]
I0105 06:48:13.489813       1 libreswan.go:229] Connection "submariner-cable-cluster-b-7-212-62-19" not found in active connections obtained from whack: map[], map[]

How do I check and fix the problem? Here's what I tested with subctl show all: image

sridhargaddam commented 2 years ago

Normally the EndpointIP and Public-IP will be different. But in your case, I see the following

  1. For cluster-b both of them are matching
  2. For cluster-a, they are different but it appears like both are public-ips
  3. Also, since subctl gather ... output is missing and subctl show all is from single cluster, I'm not sure if you are using non-overlapping CIDRs or not.
ultrastdn commented 2 years ago

Normally the EndpointIP and Public-IP will be different. But in your case, I see the following

  1. For cluster-b both of them are matching
  2. For cluster-a, they are different but it appears like both are public-ips
  3. Also, since subctl gather ... output is missing and subctl show all is from single cluster, I'm not sure if you are using non-overlapping CIDRs or not.

I'm sure the CIDRs I'm using are non-overlapping. Currently I only have one NIC on my device, so I'm not sure if it's correct to set public-ip and EndpointIP to the same IP. Or can you tell me what this public-ip is for? Thank you.

sridhargaddam commented 2 years ago

Currently I only have one NIC on my device, so I'm not sure if it's correct to set public-ip and EndpointIP to the same IP. Or can you tell me what this public-ip is for? Thank you.

You dont have to set the public-ip on the NIC. When a Cluster is advertising its local endpoint to the Broker (and subsequently to the remote clusters) the endpoint info will contain both the private-ip as well as public-ip. Submariner gateway running on the remote cluster will try to connect to both the privateIP as well as public-IP and will choose the one that is reachable - It uses the auto-NAT discovery mechanism as explained here - https://submariner.io/operations/nat-traversal/

And submariner uses the following mechanism to discover its public-ip - https://github.com/submariner-io/submariner/blob/40f060ce1bc0efc8e726b1ed84f53c49a68c0985/pkg/util/util.go#L76

Please check why the endpoints are created without the private IPs (I'm assuming that 7.212.62.x is a public-ip)

ultrastdn commented 2 years ago

Currently I only have one NIC on my device, so I'm not sure if it's correct to set public-ip and EndpointIP to the same IP. Or can you tell me what this public-ip is for? Thank you.

You dont have to set the public-ip on the NIC. When a Cluster is advertising its local endpoint to the Broker (and subsequently to the remote clusters) the endpoint info will contain both the private-ip as well as public-ip. Submariner gateway running on the remote cluster will try to connect to both the privateIP as well as public-IP and will choose the one that is reachable - It uses the auto-NAT discovery mechanism as explained here - https://submariner.io/operations/nat-traversal/

And submariner uses the following mechanism to discover its public-ip -

https://github.com/submariner-io/submariner/blob/40f060ce1bc0efc8e726b1ed84f53c49a68c0985/pkg/util/util.go#L76

Please check why the endpoints are created without the private IPs (I'm assuming that 7.212.62.x is a public-ip)

Run thekubectl annotate node kwephisprm19248 gateway.submariner.io/public-ip- command to cancel the public IP address. Run the subctl join --kubeconfig kubeconfig.cluster-b broker-info.subm --clusterid cluster-b --natt=false --force-udp-encaps command to add cluster-a and cluster-b to the broker again. After the gateway is added to the broker, the gateway repeatedly crashes and the error message is

E0106 06:33:10.090003 1 public_ip.go:80] Error resolving public IP with resolver api:api.ipify.org: retrieving public IP from https://api.ipify.org: Get "https://api.ipify.org": dial tcp 54.91.59.199:443: i/o timeout.
E0106 06:33:40.090557 1 public_ip.go:80] Error resolving public IP with resolver api:api.my-ip.io/ip: retrieving public IP from https://api.my-ip.io/ip: Get "https://api.my-ip.io/ip": dial tcp 161.35.189.70:443: i/o timeout
E0106 06:34:10.090773 1 public_ip.go:80] Error resolving public IP with resolver api:ip4.seeip.org: retrieving public IP from https://ip4.seeip.org: Get "https://ip4.seeip.org": dial tcp 23.128.64.141:443: i/o timeout
F0106 06:34:10.090872 1 main.go:134] Error creating local endpoint object from types.SubmarinerSpecification{ClusterCidr:[]string{"10.244.0.0/16"}, ColorCodes:[]string{"blue"}, GlobalCidr:[]string{}, ServiceCidr:[]string{"10.10.0.0/16"}, Broker:"k8s", CableDriver:"libreswan", ClusterID:"cluster-b", Namespace:"submariner-operator", PublicIP:"", Token:"", Debug:false, NATEnabled:false, HealthCheckEnabled:true, HealthCheckInterval:0x1, HealthCheckMaxPacketLossCount:0x5}: could not determine public IP: Unable to resolve public IP by any of the resolver methods: [api:api.ipify.org api:api.my-ip.io/ip api:ip4.seeip.org]

Previously, I was able to solve the problem by specifying public-ip, but it does not seem to be correct. What is the correct solution?

sridhargaddam commented 2 years ago

Since your environment is a private env' without access to internet, the public-ip resolution is not happening and submariner-gateway pod is failing. I will check if its safe to ignore the public-ip resolution when NAT is disabled and will push a PR (this might take sometime). In the meantime, what you can do is to add the annotation gateway.submariner.io/public-ip with some random public-ip (which anyway will fail) and auto-NAT discovery in submariner-gateway will use private-ip. Alternately, what you can also do is configure the private-ip of the gateway host as the gateway.submariner.io/public-ip on the respective gateway node.

Currently, the issue is that private ip in the local endpoints have ips like 7.212.62.x (which looks like a public-ip) and not private ips? \ The local endpoint should be created with a private-ip that can be directly reached from the gateway node of remote cluster.

Also, include the subctl gather ... output for further analysis.

ultrastdn commented 2 years ago

Since your environment is a private env' without access to internet, the public-ip resolution is not happening and submariner-gateway pod is failing. I will check if its safe to ignore the public-ip resolution when NAT is disabled and will push a PR (this might take sometime). In the meantime, what you can do is to add the annotation gateway.submariner.io/public-ip with some random public-ip (which anyway will fail) and auto-NAT discovery in submariner-gateway will use private-ip. Alternately, what you can also do is configure the private-ip of the gateway host as the gateway.submariner.io/public-ip on the respective gateway node.

Currently, the issue is that private ip in the local endpoints have ips like 7.212.62.x (which looks like a public-ip) and not private ips? The local endpoint should be created with a private-ip that can be directly reached from the gateway node of remote cluster.

Also, include the subctl gather ... output for further analysis.

  1. I confirm that 7.212.xx.xx is the private IP address of my host, and I use the code in the source code to test the IP address. image The obtained IP address is 7.212.xx.xx.
  2. You mean that you can use the private IP address of the host to assign a value to the public IP address. However, when I do this, the endpoint IP address is the same as the public IP address. In this case, the system displays a message indicating that the tunnel fails to be established.
  3. If my pod can access the public network and access api:api.ipify.org api:api.my-ip.io/ip api:ip4.seeip.org successfully, can the preceding problems be solved? If the problem persists, the logs generated by the subctl gather ... will be provided.
dfarrell07 commented 2 years ago

Is this still relevant? Note the air-gapped discussion here: https://github.com/submariner-io/submariner/issues/1790

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.