Exported service results in timeout from consuming onPremise cluster

submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.

https://submariner.io

Apache License 2.0

2.4k stars 188 forks source link

Exported service results in timeout from consuming onPremise cluster #2934

Open IceManGreen opened 6 months ago

IceManGreen commented 6 months ago

Hello everyone,

I have an issue with a Submariner usecase where I want to separate the control-plane network (used by the broker to communicate with the participating clusters) and the data plane network (used by participating clusters to connect my applications through the gateways).

I have deployed 3 clusters :

1 Broker cluster called e2e-mgmt
2 deployment clusters called domain-2 and domain-3

Note that each deployment clusters have 3 nodes. Each node has 2 interfaces :

1 for the control plane : 172.16.100.0/22 dev enp1s0
1 for the data plane : 172.16.110.0/22 dev enp2s0

The nodes are actually virtual machines. Some tests showed that each VM can reach the other communicating through the control-plane or the data-plane (ex: from 172.16.100.10 to 172.16.100.11 but not from 172.16.100.10 to 172.16.110.11).

I labeled and annotated all my nodes in domain-2 and domain-3 like :

$ kubectl annotate --list node --all --context domain-3 | grep submariner.io/public-ip
gateway.submariner.io/public-ip=ipv4:172.16.110.84
gateway.submariner.io/public-ip=ipv4:172.16.110.85
gateway.submariner.io/public-ip=ipv4:172.16.110.86
$ kubectl label --list nodes --all  --context domain-3 | grep submariner
submariner.io/gateway=true
submariner.io/gateway=true
submariner.io/gateway=true

Because I want the VPN tunnels to communicate through 172.16.110.0/24 (data plane) even if the Kubernetes APIs are listening to 172.16.100.0/24 (control plane).

I created the broker and joined the clusters using :

subctl deploy-broker --globalnet --context e2e-mgmt
subctl join broker-info.subm --clustercidr "10.42.0.0/16" --globalnet --context domain-2 --clusterid domain-2
subctl join broker-info.subm --clustercidr "10.42.0.0/16" --globalnet --context domain-3 --clusterid domain-3

Indeed, the clusters joined the broker :

kubectl -n submariner-k8s-broker get clusters.submariner.io --context e2e-mgmt
NAME       AGE
domain-2   6h10m
domain-3   5h38m

The connections seem good :

$ subctl show connections
subctl show connections
Cluster "domain-2"
 ✓ Showing Connections
GATEWAY    CLUSTER    REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.    
hermione   domain-3   172.16.100.85   no    libreswan      242.1.0.0/16   connected   998.16µs    

Cluster "domain-3"
 ✓ Showing Connections 
GATEWAY   CLUSTER    REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.     
porthos   domain-2   172.16.100.81   no    libreswan      242.0.0.0/16   connected   859.482µs

Cluster "e2e-mgmt"
 ⚠ Submariner connectivity feature is not installed

Or maybe the remote IPs should be on 172.16.110.0/24 ?

The gateways seem good :

$ subctl show gateways 
Cluster "domain-3"
 ✓ Showing Gateways
NODE       HA STATUS   SUMMARY                               
harry      passive     There are no connections              
hermione   active      All connections (1) are established   
ronald     passive     There are no connections              

Cluster "e2e-mgmt"
 ⚠ Submariner connectivity feature is not installed

Cluster "domain-2"
 ✓ Showing Gateways 
NODE      HA STATUS   SUMMARY                               
aramis    passive     There are no connections              
athos     passive     There are no connections              
porthos   active      All connections (1) are established

I created a Nginx service in domain-2 and namespace hello-domain-2 called hello-world-svc.
Using netshoot, I can tell that the requests are working locally from domain-2

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot --context domain-2
> curl hello-world-svc.hello-domain-2.svc.clusterset.local. -sSLI
> # HTTP/1.1 200 OK

I exported the service and domain-3 consumed properly the ServiceImport from the Broker :

$ kubectl get serviceimport -n hello-domain-2 --context domain-3
NAME              TYPE           IP    AGE
hello-world-svc   ClusterSetIP         10m

However, the same test with netshoot from domain-3 does not work :

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot --context domain-3
> curl hello-world-svc.hello-domain-2.svc.clusterset.local. -sSLI
> # TIMEOUT

Even if dig resolves it properly :

$ dig hello-world-svc.hello-domain-2.svc.clusterset.local.
; <<>> DiG 9.18.21 <<>> hello-world-svc.hello-domain-2.svc.clusterset.local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57307
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: f53618936495597c (echoed)
;; QUESTION SECTION:
;hello-world-svc.hello-domain-2.svc.clusterset.local. IN        A

;; ANSWER SECTION:
hello-world-svc.hello-domain-2.svc.clusterset.local. 5 IN A 242.0.255.253

;; Query time: 8 msec
;; SERVER: 10.43.0.10#53(10.43.0.10) (UDP)
;; WHEN: Fri Mar 08 15:41:07 UTC 2024
;; MSG SIZE  rcvd: 159

What did I do wrong ?

sridhargaddam commented 6 months ago

@IceManGreen we would need some additional info to narrow down the problem. Can you clarify the following.

What is the CNI that is used in your clusters?
What is the Submariner version?
Also, can you run subctl verify --context <kubeContext1> --tocontext <kubeContext2> --only connectivity --verbose. This will help us to know if the connectivity is working between the Gateway nodes but failing when client/server is on a non-Gateway node. Note: Since subctl show connections is looking good, I think Gateway to Gateway communication is probably working in your setup, but when one of them is on non-Gateway, its failing.
Please attach the tar of subctl gather from domain-2 and domain-3 clusters.
Also, did you check if your use-case (i.e., curl to remote cluster service) is working fine when both control-plane and data-plane were using the same interface?

IceManGreen commented 6 months ago

@sridhargaddam hello ! Thanks for your answer.

What is the CNI that is used in your clusters?

I use K3S with Flannel :

$ /var/lib/rancher/k3s/data/current/bin/flannel
CNI Plugin flannel version v0.22.2 (linux/amd64) commit HEAD built on 2024-02-06T01:58:54Z

What is the Submariner version?

$ subctl show versions
Cluster "domain-2"
 ✓ Showing versions 
COMPONENT                       REPOSITORY           CONFIGURED   RUNNING                     ARCH    
submariner-gateway              quay.io/submariner   0.17.0       release-0.17-72c0e6dd56c8   amd64   
submariner-routeagent           quay.io/submariner   0.17.0       release-0.17-72c0e6dd56c8   amd64   
submariner-globalnet            quay.io/submariner   0.17.0       release-0.17-72c0e6dd56c8   amd64   
submariner-metrics-proxy        quay.io/submariner   0.17.0       release-0.17-81b7e55f5306   amd64   
submariner-operator             quay.io/submariner   0.17.0       release-0.17-d750fbdcb610   amd64   
submariner-lighthouse-agent     quay.io/submariner   0.17.0       release-0.17-7ad4dd387b0b   amd64   
submariner-lighthouse-coredns   quay.io/submariner   0.17.0       release-0.17-7ad4dd387b0b   amd64   

Cluster "domain-3"
 ✓ Showing versions 
COMPONENT                       REPOSITORY           CONFIGURED   RUNNING                     ARCH    
submariner-gateway              quay.io/submariner   0.17.0       release-0.17-72c0e6dd56c8   amd64   
submariner-routeagent           quay.io/submariner   0.17.0       release-0.17-72c0e6dd56c8   amd64   
submariner-globalnet            quay.io/submariner   0.17.0       release-0.17-72c0e6dd56c8   amd64   
submariner-metrics-proxy        quay.io/submariner   0.17.0       release-0.17-81b7e55f5306   amd64   
submariner-operator             quay.io/submariner   0.17.0       release-0.17-d750fbdcb610   amd64   
submariner-lighthouse-agent     quay.io/submariner   0.17.0       release-0.17-7ad4dd387b0b   amd64   
submariner-lighthouse-coredns   quay.io/submariner   0.17.0       release-0.17-7ad4dd387b0b   amd64   

Cluster "e2e-mgmt"
 ✓ Showing versions 
COMPONENT             REPOSITORY           CONFIGURED   RUNNING                     ARCH    
submariner-operator   quay.io/submariner   0.17.0       release-0.17-d750fbdcb610   amd64

Also, can you run subctl verify --context --tocontext --only connectivity --verbose ?

It seems that I have failing tests on domain-2 and domain-3 because the pods that must run the tests cannot be deployed.
The events mention :

"message": "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.."

What affinity/selector should have the nodes to be able to deploy the pods for tests ?

See attachment for the entire file.

verifiy.log

Please attach the tar of subctl gather from domain-2 and domain-3 clusters.

See attachment (sorry github only supports zip files, not tar).

submariner-gather.zip

Also, did you check if your use-case (i.e., curl to remote cluster service) is working fine when both control-plane and data-plane were using the same interface?

To verify this, I installed HTTP servers on every domain listening on :

enp1s0 on port 8080
enp2s0 on port 8081

Test for control-plane (enp1s0) :

# from domain-3 to domain-2
# control-plane
curl 172.16.100.81:8080 -sSLI
HTTP/1.0 200 OK
# data plane
curl 172.16.110.81:8081 -sSLI
HTTP/1.0 200 OK

# from domain-2 to domain-3
# control plane
curl 172.16.100.84:8080 -sSLI
HTTP/1.0 200 OK
# data plane
curl 172.16.110.84:8081 -sSLI
HTTP/1.0 200 OK

Everything works fine for simple curl commands.

yboaron commented 6 months ago

What affinity/selector should have the nodes to be able to deploy the pods for tests ?

In subctl verify we deploy connectivity test pods on GW node for some tests and on non-GW nodes for other tests, we use kubernetes.io/hostname NodeSelector for that purpose.

Could you label(submariner.io/gateway=true) only a single node as GW and rerun subctl verify ?

IceManGreen commented 6 months ago

What affinity/selector should have the nodes to be able to deploy the pods for tests ?

In subctl verify we deploy connectivity test pods on GW node for some tests and on non-GW nodes for other tests, we use kubernetes.io/hostname NodeSelector for that purpose.

Could you label(submariner.io/gateway=true) only a single node as GW and rerun subctl verify ?

Hello @yboaron,

Do you mean that if more than one node is labelled with submariner.io/gateway=true, the tests cannot run ? For know, indeed I have all of the nodes of my clusters labelled with it :

$ kubectl label --list nodes --all  --context domain-2 | grep submariner
 submariner.io/gateway=true
 submariner.io/gateway=true
 submariner.io/gateway=true
$ kubectl label --list nodes --all  --context domain-3 | grep submariner
 submariner.io/gateway=true
 submariner.io/gateway=true
 submariner.io/gateway=true

yboaron commented 6 months ago

Since you labelled all 3 nodes as GWs on both clusters , tests that need to run one of the test pods on non-GW node will fail [1] while tests that need to run client pod and listener pod on GW will succeed [2] .

Can you label(submariner.io/gateway=true) only a single node as GW on each cluster and rerun subctl verify ?

[1] `[0mBasic TCP connectivity tests across overlapping clusters without discovery [38;5;243mwhen a pod connects via TCP to the globalIP of a remote service [0mwhen the pod is on a gateway and the remote service is not on a gateway [38;5;9m[1m[It] should have sent the expected data from the pod to the other pod[0m [38;5;204m[dataplane, globalnet][0m [38;5;243mgithub.com/submariner-io/submariner@v0.17.0/test/e2e/dataplane/tcp_gn_pod_connectivity.go:53[0m

[38;5;9m[FAILED] Failed to await pod ready. Pod "tcp-check-listenermxz66" is still pending: status: { "phase": "Pending", "conditions": [ { "type": "PodScheduled", "status": "False", "lastProbeTime": null, "lastTransitionTime": "2024-03-11T09:05:50Z", "reason": "Unschedulable", "message": "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.." } ], "qosClass": "BestEffort" } `

[2] `[0mBasic TCP connectivity tests across overlapping clusters without discovery [38;5;243mwhen a pod connects via TCP to the globalIP of a remote service [0mwhen the pod is on a gateway and the remote service is on a gateway [0m[1mshould have sent the expected data from the pod to the other pod[0m [38;5;204m[dataplane, globalnet, basic][0m [38;5;243mgithub.com/submariner-io/submariner@v0.17.0/test/e2e/dataplane/tcp_gn_pod_connectivity.go:53[0m [1mSTEP:[0m Creating namespace objects with basename "dataplane-gn-conn-nd" [38;5;243m@ 03/11/24 09:09:51.484[0m [1mSTEP:[0m Generated namespace "e2e-tests-dataplane-gn-conn-nd-zz54k" in cluster "domain-2" to execute the tests in [38;5;243m@ 03/11/24 09:09:51.737[0m [1mSTEP:[0m Creating namespace "e2e-tests-dataplane-gn-conn-nd-zz54k" in cluster "domain-3" [38;5;243m@ 03/11/24 09:09:51.737[0m [1mSTEP:[0m Creating a listener pod in cluster "domain-3", which will wait for a handshake over TCP [38;5;243m@ 03/11/24 09:09:52.507[0m [1mSTEP:[0m Pointing a ClusterIP service to the listener pod in cluster "domain-3" [38;5;243m@ 03/11/24 09:10:04.21[0m [1mSTEP:[0m Creating a connector pod in cluster "domain-2", which will attempt the specific UUID handshake over TCP [38;5;243m@ 03/11/24 09:10:07.866[0m Mar 11 09:10:21.482: INFO: ExecWithOptions &{Command:[sh -c for j in $(seq 1 50); do echo [dataplane] connector says 67dbefdb-e262-4c91-a4bd-ac2c20eb151a; done | for i in $(seq 2); do if nc -v 242.1.255.253 1234 -w 60; then break; else sleep 30; fi; done] Namespace:e2e-tests-dataplane-gn-conn-nd-zz54k PodName:customl55cl ContainerName:connector-pod Stdin: CaptureStdout:true CaptureStderr:true PreserveWhitespace:true} [1mSTEP:[0m Connector pod is scheduled on node "porthos" [38;5;243m@ 03/11/24 09:10:22.741[0m [1mSTEP:[0m Waiting for the listener pod "tcp-check-listenerrbzc9" on node "hermione" to exit, returning what listener sent [38;5;243m@ 03/11/24 09:10:22.741[0m Mar 11 09:10:29.806: INFO: Pod "tcp-check-listenerrbzc9" on node "hermione" output: listening on 0.0.0.0:1234 ... connect to 10.42.1.37:1234 from 242.0.0.8:33109 (242.0.0.8:33109) [dataplane] connector says 67dbefdb-e262-4c91-a4bd-ac2c20eb151a

[1mSTEP:[0m Verifying that the listener got the connector's data and the connector got the listener's data [38;5;243m@ 03/11/24 09:10:31.115[0m [1mSTEP:[0m Verifying the output of the listener pod contains a cluster-scoped global IP [242.0.0.1 242.0.0.2 242.0.0.3 242.0.0.4 242.0.0.5 242.0.0.6 242.0.0.7 242.0.0.8] of the connector Pod [38;5;243m@ 03/11/24 09:10:31.115[0m [1mSTEP:[0m Deleting service "test-svc-tcp-check-listener" on "domain-3" [38;5;243m@ 03/11/24 09:10:31.873[0m [1mSTEP:[0m Deleting namespace "e2e-tests-dataplane-gn-conn-nd-zz54k" on cluster "domain-2" [38;5;243m@ 03/11/24 09:10:32.83[0m [1mSTEP:[0m Deleting namespace "e2e-tests-dataplane-gn-conn-nd-zz54k" on cluster "domain-3" [38;5;243m@ 03/11/24 09:10:33.62[0m [38;5;10m• [43.343 seconds][0m `

IceManGreen commented 6 months ago

Ok so I applied the label "submariner.io/gateway=true" on only one node for domain-2 and domain-3 clusters.

But know I am confused because the domain-3 connection is down :

subctl show connections
Cluster "e2e-mgmt"
 ⚠ Submariner connectivity feature is not installed

Cluster "domain-2"
 ✓ Showing Connections
GATEWAY   CLUSTER    REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS       RTT avg.   
harry     domain-3   172.16.100.84   no    libreswan      242.1.0.0/16   connecting   0s         

Cluster "domain-3"
 ✗ Showing Connections
 ✗ No connections found

So I reinstalled Submariner on domain-3 but I got the same problem.

subctl uninstall --context domain-3
subctl join broker-info.subm --clustercidr "10.42.0.0/16" --globalnet --clusterid domain-3 --context domain-3

Every pod seems fine :

kubectl get pods -n submariner-operator --context domain-3
NAME                                             READY   STATUS    RESTARTS   AGE
submariner-gateway-qhq6s                         1/1     Running   0          25m
submariner-globalnet-b46h6                       1/1     Running   0          25m
submariner-lighthouse-agent-749f576cd9-t87fw     1/1     Running   0          25m
submariner-lighthouse-coredns-86b594f7cd-fptx7   1/1     Running   0          25m
submariner-lighthouse-coredns-86b594f7cd-qgz7m   1/1     Running   0          25m
submariner-metrics-proxy-qfjlv                   2/2     Running   0          25m
submariner-operator-7994fc86c5-w95w8             1/1     Running   0          26m
submariner-routeagent-6jv4v                      1/1     Running   0          25m
submariner-routeagent-bkbsk                      1/1     Running   0          25m
submariner-routeagent-dj7c5                      1/1     Running   0          25m

But the gateway is showing an error in logs (kubectl logs submariner-gateway-qhq6s -n submariner-operator --context domain-3) :

# ...
2024-03-11T15:22:20.397Z ERR ..gine/cableengine.go:147 CableEngine          Error installing cable for &natdiscovery.NATEndpointInfo{Endpoint:v1.Endpoint{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"domain-2-submariner-cable-domain-2-172-16-100-81", GenerateName:"", Namespace:"submariner-operator", SelfLink:"", UID:"57c8b6c2-6bdb-411b-8eb9-dcf6282515a1", ResourceVersion:"3416393", Generation:1, CreationTimestamp:time.Date(2024, time.March, 11, 14, 55, 19, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"submariner-io/clusterID":"domain-2"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"submariner-gateway", Operation:"Update", APIVersion:"submariner.io/v1", Time:time.Date(2024, time.March, 11, 14, 55, 19, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc000204558), Subresource:""}}}, Spec:v1.EndpointSpec{ClusterID:"domain-2", CableName:"submariner-cable-domain-2-172-16-100-81", HealthCheckIP:"242.0.255.254", Hostname:"porthos", Subnets:[]string{"242.0.0.0/16"}, PrivateIP:"172.16.100.81", PublicIP:"172.16.110.81", NATEnabled:true, Backend:"libreswan", BackendConfig:map[string]string{"natt-discovery-port":"4490", "preferred-server":"false", "public-ip":"ipv4:172.16.110.81", "udp-port":"4500"}}}, UseNAT:false, UseIP:"172.16.100.81"} error="error installing Endpoint cable \"submariner-cable-domain-2-172-16-100-81\": error whacking with args [--psk --encrypt --name submariner-cable-domain-2-172-16-100-81-0-0 --id 172.16.100.84 --host 172.16.100.84 --client 242.1.0.0/16 --ikeport 4500 --to --id 172.16.100.81 --host 172.16.100.81 --client 242.0.0.0/16 --ikeport 4500 --dpdaction=hold --dpddelay 30]: exit status 20"

yboaron commented 6 months ago

Could you please reinstall submariner on both clusters and in case you still hit the connection issue please share subctl gather from both clusters.

BTW, I can see that Submariner detects the CNI as generic and not flannel , do you have daemonset named flannel in kube-system NS ? do you have volume named flannel in this daemonset ?

IceManGreen commented 6 months ago

Hello @yboaron

Sorry for the late answer ! I tested multiple things :

reinstall Submariner on both of the clusters -> not successful
reinstall the domain-3 cluster (K3S) and Submariner on the former -> not successful
reinstall Submariner on both of the clusters with subctl join broker-info.subm --clustercidr "10.42.0.0/16" --globalnet --cable-driver wireguard -> connections are successful but not the tests :

$ subctl show connections
Cluster "domain-3"
 ✓ Showing Connections
GATEWAY   CLUSTER    REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.     
porthos   domain-2   172.16.100.81   no    wireguard      242.0.0.0/16   connected   1.430896ms   

Cluster "e2e-mgmt"
 ⚠ Submariner connectivity feature is not installed

Cluster "domain-2"
 ✓ Showing Connections
GATEWAY   CLUSTER    REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.     
harry     domain-3   172.16.100.84   no    wireguard      242.1.0.0/16   connected   1.006687ms

But in the end, the tests were not successful :

Summarizing 10 Failures:                                                                                                                    
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod connects via TCP to the globalIP of a remote 
service when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent the expected data from the pod to 
the other pod [dataplane, globalnet, basic]                                                                                                 
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196                                                          
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod connects via TCP to the globalIP of a remote 
service when the pod is on a gateway and the remote service is not on a gateway [It] should have sent the expected data from the pod to the 
other pod [dataplane, globalnet]                                                                                                            
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196                                                          
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod matching an egress IP namespace selector conn
ects via TCP to the globalIP of a remote service when the pod is not on a gateway and the remote service is not on a gateway [It] should hav
e sent the expected data from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod matching an egress IP namespace selector conn
ects via TCP to the globalIP of a remote service when the pod is on a gateway and the remote service is on a gateway [It] should have sent t
he expected data from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/submariner@v0.17.0/test/e2e/framework/dataplane.go:200
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod matching an egress IP pod selector connects v
ia TCP to the globalIP of a remote service when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent
 the expected data from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod matching an egress IP pod selector connects v
ia TCP to the globalIP of a remote service when the pod is on a gateway and the remote service is on a gateway [It] should have sent the exp
ected data from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/submariner@v0.17.0/test/e2e/framework/dataplane.go:200
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod with HostNetworking connects via TCP to the g
lobalIP of a remote service when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent the expected d
ata from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod with HostNetworking connects via TCP to the g
lobalIP of a remote service when the pod is on a gateway and the remote service is not on a gateway [It] should have sent the expected data 
from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod with HostNetworking connects via TCP to the g
lobalIP of a remote service when the pod is on a gateway and the remote service is not on a gateway [It] should have sent the expected data 
from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod connects via TCP to the globalIP of a remote 
headless service when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent the expected data from th
e pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196
  [FAIL] Basic TCP connectivity tests across overlapping clusters without discovery when a pod connects via TCP to the globalIP of a remote 
service in reverse direction when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent the expected 
data from the pod to the other pod [dataplane, globalnet]
  github.com/submariner-io/shipyard@v0.17.0/test/e2e/framework/network_pods.go:196

Ran 13 of 47 Specs in 1888.586 seconds
FAIL! -- 3 Passed | 10 Failed | 0 Pending | 34 Skipped

Here is the complete file : verify-202403121045.log

I think there is something important to note here : K3S supports different backends for Flannel, so I configure it with Wireguard to encrypt the inter-node communications.
I suppose that Submariner must adapt its cable-driver with wireguard then. The main concern I have is that K3S will deprecate the support of IPSec for its flannel CNI :

K3s no longer includes strongSwan swanctl and charon binaries starting with the 2022-12 releases (v1.26.0+k3s1, v1.25.5+k3s1, v1.24.9+k3s1, v1.23.15+k3s1). Please install the correct packages on your node before upgrading to or installing these releases if you want to use the ipsec backend.

(source in k3s documentation)

Q: can you confirm that wireguard as the cable driver is not supposed to change/imply more modifications during the installation/runtime of Submariner than --cable-driver wireguard ?

BTW, I can see that Submariner detects the CNI as generic and not flannel , do you have daemonset named flannel in kube-system NS ? do you have volume named flannel in this daemonset ?

Indeed you are right, the cni is detected as generic (for both of the clusters). Example during the installation on domain-2 :

$ subctl join broker-info.subm --clustercidr "10.42.0.0/16" --globalnet  --cable-driver wireguard --clusterid domain-2 --context domain-2
 ✓ broker-info.subm indicates broker is at https://172.16.100.99:6443
 ✓ Discovering network details 
        Network plugin:  generic # <--- HERE
        Service CIDRs:   [10.43.0.0/16]
        Cluster CIDRs:   []
   There are 1 node(s) labeled as gateways:
    - porthos

But in K3S there is no daemonset created for Flannel. Is Submariner supposed to base this detection on a deamonset ?

yboaron commented 6 months ago

Q: can you confirm that wireguard as the cable driver is not supposed to change/imply more modifications during the installation/runtime of Submariner than --cable-driver wireguard ?

To enable WireGuard cable driver, you need to install it on GW node (search WireGuard here )

But in K3S there is no daemonset created for Flannel. Is Submariner supposed to base this detection on a deamonset ?

Yep, but we can workaround pod CIDR discovery by specifying --clustercidr flag in join command (as you did).

Can you upload subctl gather and subctl diagnose all from both clusters ?

BTW, did you install clusters with overlapping CIDRs on purpose? if it isn't mandatory maybe we can start with non-overlapping CIDRs and eliminate GlobalNet complexity

IceManGreen commented 6 months ago

Ok so I made some tests following your recommendations :

I installed wireguard on the gateway nodes for both of the clusters
I installed K3S again to modify the cluster CIDR and the service CIDR :
- pods in 10.44.0.0/16 and services in 10.45.0.0/16 for cluster domain-2 (--cluster-cidr=10.44.0.0/16 --service-cidr=10.45.0.0/16)
- pods in 10.46.0.0/16 and services in 10.47.0.0/16 for cluster domain-3 (--cluster-cidr=10.46.0.0/16 --service-cidr=10.47.0.0/16)
next, I provided the former cidrs to the join commands and I removed every --globalnet option from them. Something like :

subctl deploy-broker --context e2e-mgmt
subctl join broker-info.subm --clustercidr "10.44.0.0/16" --servicecidr "10.45.0.0/16" --cable-driver wireguard --clusterid domain-2 --context domain-2
subctl join broker-info.subm --clustercidr "10.46.0.0/16" --servicecidr "10.47.0.0/16" --cable-driver wireguard --clusterid domain-3 --context domain-3

But again, the tests fail.

Here are the results of subctl gather and subctl diagnose all :

diagnose-202403121620.log submariner-gather-202403121620.zip

yboaron commented 6 months ago

Sorry for the late answer,

I checked the logs, looks fine, Wireguard connection is UP between clusters, I assume its some datapath issue for this combination of Flannel CNI, WireGuard and K3S, which needs further investigation.

Since the WireGuard connection is up, I would expect at least the tests between pod@gw_node in domain2 and pod@gw_node in domain3 to pass.

What is the latest output of subctl verify --context --tocontext < cluster2_context> --only connectivity --verbose ?

Can you also check subctl diagnose firewall intra-cluster --kubeconfig < cluster_kubeconfig> ?

IceManGreen commented 6 months ago

Hello @yboaron, yes I agree that the combination of the three is suspicious. I will execute the tests you ask and I was thinking about making the same tests with Cilium instead of Flannel, just in case.

Unfortunately, due to the Kubecon EU, I will not be able to test these, this week. I will keep you in touch as soon as I can.

Thanks again !

IceManGreen commented 5 months ago

Hello @yboaron ! Sorry for the veeery long delay, I was busy to try new things after the Kubecon !

Regarding this issue, I had to deploy the clusters from scratch using Cilium, but it is still failing. Sorry, I changed one or two thins but nothing big :

Cluster e2e-mgmt is the broker, domain-1 and domain-2 are the participating clusters
K3S with --cluster-cidr=10.10.0.0/16 --service-cidr=10.11.0.0/16 for domain-1
K3S with --cluster-cidr=10.12.0.0/16 --service-cidr=10.13.0.0/16 for domain-2
Using Cilium as the CNI for domain-1 and domain-2

I deployed the submariner broker as usual. Then for the participating clusters :

# labels and annotations
kubectl label node porthos "submariner.io/gateway=true" --context domain-1
kubectl label node harry "submariner.io/gateway=true" --context domain-2

Joining domain-1 and domain-2 :

subctl join broker-info.subm --clustercidr "10.10.0.0/16" --servicecidr "10.11.0.0/16" --cable-driver wireguard --clusterid domain-1 --context domain-1
subctl join broker-info.subm --clustercidr "10.12.0.0/16" --servicecidr "10.13.0.0/16" --cable-driver wireguard --clusterid domain-2 --context domain-2

I deployed this in domain-2 in the namespace federation-1 :

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rebel-base
spec:
  selector:
    matchLabels:
      name: rebel-base
  replicas: 1
  template:
    metadata:
      labels:
        name: rebel-base
    spec:
      containers:
        - name: rebel-base
          image: docker.io/nginx:1.15.8
          ports:
            - containerPort: 80
              name: http
          volumeMounts:
            - name: html
              mountPath: /usr/share/nginx/html/
      volumes:
        - name: html
          configMap:
            name: rebel-base-response
            items:
              - key: message
                path: index.html
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rebel-base-response
data:
  message: "hello federation-1 from domain-2\n"
---
apiVersion: v1
kind: Service
metadata:
  name: rebel-base-svc
spec:
  type: ClusterIP
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: http
  selector:
    name: rebel-base

Now I export the service :

subctl export service rebel-base-svc -n federation-1 --context domain-2
kubectl get serviceimport -n federation-1 --context domain-1
NAME         TYPE           IP    AGE
rebel-base-svc   ClusterSetIP         96s

Local requests work (domain-2 to domain-2) :

$ kubectl run x-wing --rm -it --image nicolaka/netshoot --context domain-1 -- \
    curl rebel-base-svc.federation-1.svc.clusterset.local.
hello federation-1 from domain-2

But requests from domain-1 continue to fail :

$ kubectl run x-wing --rm -it --image nicolaka/netshoot --context domain-1 -- \
    curl rebel-base-svc.federation-1.svc.clusterset.local. --connect-timeout 3
curl: (28) Failed to connect to rebel-base-svc.federation-1.svc.clusterset.local. port 80 after 3002 ms: Timeout was reached

This time, I used Cilium Hubble to monitor the network.

I found something strange :

The request from domain-1 reaches the service in domain-2
But the response seems to be dropped in domain-2 while it is intended to domain-1 (federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK))

Observing the pod x-wing in domain-1 during the curl request :

Apr  9 13:35:59.989: default/x-wing:56847 (ID:84471) -> kube-system/coredns-6799fbcd5-vgsnf:53 (ID:110854) to-endpoint FORWARDED (UDP)
Apr  9 13:35:59.991: default/x-wing:56847 (ID:84471) <- kube-system/coredns-6799fbcd5-vgsnf:53 (ID:110854) to-endpoint FORWARDED (UDP)
Apr  9 13:35:59.992: default/x-wing:48924 (ID:84471) -> 10.13.178.142:80 (world) to-stack FORWARDED (TCP Flags: SYN)
# ends here

10.13.178.142:80 is indeed the rebel-base-svc in domain-2 :

kubectl get svc -n federation-1 --context domain-2 -o wide
NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE   SELECTOR
rebel-base-svc   ClusterIP   10.13.178.142   <none>        80/TCP    44m   name=rebel-base

Observing the pod rebel-base in domain-2 during the curl request :

Apr  9 13:36:28.812: 10.10.2.95:40152 (world) -> federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) to-endpoint FORWARDED (TCP Flags: SYN)
Apr  9 13:36:28.812: 10.10.2.95:40152 (world) <- federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) to-stack FORWARDED (TCP Flags: SYN, ACK)
Apr  9 13:36:28.812: federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
Apr  9 13:36:29.041: 10.10.2.95:40152 (world) <> federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) to-overlay FORWARDED (TCP Flags: SYN)
Apr  9 13:36:29.826: federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
Apr  9 13:36:29.842: federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
Apr  9 13:36:30.071: 10.10.2.95:40152 (world) <> federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) to-overlay FORWARDED (TCP Flags: SYN)
Apr  9 13:36:31.842: federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
Apr  9 13:36:36.066: 10.10.2.95:40152 (world) <- federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) to-stack FORWARDED (TCP Flags: SYN, ACK)
Apr  9 13:36:36.067: federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
Apr  9 13:36:44.259: 10.10.2.95:40152 (world) <- federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) to-stack FORWARDED (TCP Flags: SYN, ACK)
Apr  9 13:36:44.259: federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)
Apr  9 13:37:00.386: 10.10.2.95:40152 (world) <- federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) to-stack FORWARDED (TCP Flags: SYN, ACK)
Apr  9 13:37:00.386: federation-1/rebel-base-596bc7d8ff-45tpg:80 (ID:142026) <> 10.10.2.95:40152 (world) Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK)

You can notice the errors messages Stale or unroutable IP DROPPED (TCP Flags: SYN, ACK). 10.10.2.95:40152 is indeed the x-wing pod in domain-1 :

kubectl get pod -n default --context domain-1 -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
x-wing   1/1     Running   0          46m   10.10.2.95   athos   <none>           <none>

I have no idea why the packets are not correctly routed in both sides.

yboaron commented 5 months ago

Hi @IceManGreen ,

A. We have the following configuration K3S, Cilium as the CNI and cable driver is wireguard.

B. You can resolve domain2/rebel-base-svc IP from domain1, which means that submariner inter-cluster wireguard tunnels are up and multi-cluster service discovery looks fine.

C. So, we need to understand why SYN/ACK packet is being dropped, Submariner implements the egress datapath between clusters lets the CNI handle ingress, and sets rp_filter to loose mode to allow asymmetric traffic in the CNI network interfaces. check this for more details.

It could be that Submariner failed to detect the CNI network interface and updates rp_filter(Submariner looks for a network interface with an IP address from the clustercidr range), and packets are dropped by the kernel or some firewall/infra SG that blocks inter-cluster traffic.

The output of subctl verify --context < kubeContextDomain1 > --tocontext < kubeContextDomain2 > --only connectivity --verbose can help us narrow down the issue
Please attach subctl gather --kubeconfig from both clusters

IceManGreen commented 5 months ago

Hi @yboaron

Thanks for the summary !

Here is the result of subctl verify and subctl gather for (and between) domain-1 and domain-2 :

subctl-testing.zip

According to your words about the CNI detection by Submariner, I ran the following test :

$ subctl diagnose cni --context domain-1
 ⚠ Checking Submariner support for the CNI network plugin
 ⚠ Submariner could not detect the CNI network plugin and is using ("generic") plugin. It may or may not work.

Indeed, Submariner does not detect Cilium as the cluster CNI. Same thing in domain-2.

Do I have a way to force Submariner to consider the cluster's CNI as Cilium ? I do not know if such a question makes sense.

yboaron commented 4 months ago

Hi @IceManGreen , Sorry for the late response,

Yep, Submariner doesn't support Cilium detection, the CNI auto detection is used in Submariner for: A. clustercidr and servicecidr auto-discovery (which you workaround by providing --clustercidr "10.10.0.0/16" --servicecidr "10.11.0.0/16" to subctl join) Are "10.10.0.0/16" and "10.11.0.0/16" for domain1 and "10.12.0.0/16" and "10.13.0.0/16" for domain2 the clusters CIDRS? B. Run some specific routing configuration if needed for this CNI , for example check this for Calico CNI.

In your case Submariner failed to detect Cilium and used "generic" CNI, Submariner refers to generic CNI as a kube-proxy/iptables based CNI.

I checked Submariner logs from both clusters, found no errors. Submariner route-agent detects 'cilium_host' interface as the CNI interface (interface with IP address from clusterCIDR) and updates its rp_filter to '2'.
From subctl verify output, it appears that all the connectivity tests in which one of client or server pods is running on non GW nodes are failing, further datapath investigation is needed here.
I think I understand why the TCP SYN/ACK packet is wrongly handled by cilium (it should be handled by Submariner), x-wing pod IP 10.10.2.95, and checking domain-2/hermione node ip-routing table I can see:

10.10.0.0/16 via 240.19.112.84 dev vx-submariner proto static 10.10.1.0/24 via 10.12.1.86 dev cilium_host proto kernel src 10.12.1.86 mtu 1370 10.10.2.0/24 via 10.12.1.86 dev cilium_host proto kernel src 10.12.1.86 mtu 1370

So, packet should be routed by Submariner according to 10.10.0.0/16 via 240.19.112.84 dev vx-submariner proto static
but some component added some routes that includes remote cluster CIDR , like: 10.10.2.0/24 via 10.12.1.86 dev cilium_host proto kernel src 10.12.1.86 mtu 1370

in this case kernel (Longest prefix match) will choose to route the packet wrongly using 10.10.2.0/24 via 10.12.1.86 dev cilium_host proto kernel src 10.12.1.86 mtu 1370 and not via vx-submariner.

I can see similar routes also on domain1 cluster, for example : domain-1/aramis node:

ip route show default via 172.16.110.1 dev enp2s0 10.10.0.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 10.10.0.46 dev cilium_host proto kernel scope link 10.10.1.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1450 10.10.2.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1450 10.12.0.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1370 10.12.0.0/16 via 240.19.112.81 dev vx-submariner proto static 10.12.1.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1370 10.12.2.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1370 10.13.0.0/16 via 240.19.112.81 dev vx-submariner proto static 10.14.0.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1370 10.14.1.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1370 10.14.2.0/24 via 10.10.0.46 dev cilium_host proto kernel src 10.10.0.46 mtu 1370 172.16.100.0/22 dev enp1s0 proto kernel scope link src 172.16.100.82 172.16.110.0/22 dev enp2s0 proto kernel scope link src 172.16.110.82 240.0.0.0/8 dev vx-submariner proto kernel scope link src 240.19.116.82

You should first address these routes issue

yboaron commented 4 months ago

Hi @IceManGreen , any update? can we close this issue?

IceManGreen commented 3 months ago

Hello @yboaron , sorry for the very late answer too !

Are "10.10.0.0/16" and "10.11.0.0/16" for domain1 and "10.12.0.0/16" and "10.13.0.0/16" for domain2 the clusters CIDRS?

Yes it is !

Regarding the rest of your answer, thank you so much for guiding me through so much details, it helps a lot. Unfortunately, I did not succeed in managing the routes and make things work in the end. Since CIlium is adding its own routing logic that comes in conflicts with Submariner, it makes the solution very hard to manage because I was planning to scale things up in terms machines and clusters ... So I will probably look for another solution, like Calico + Submariner if they fit together. This is what I recommend for anyone who faces the same kind of trouble with Cilium and Submariner.

Thanks again for your help ! I think we are good with this issue !

yboaron commented 3 months ago

@IceManGreen , I'm going to close this topic, feel free to reopen it if it's still relevant

huangjiasingle commented 3 months ago

@yboaron @IceManGreen I have encountered the same problem in the flannel environment. The root cause of the problem is that the MAC address of the vx-submariner network card created by agent-route on each node is the same, resulting in the fact that the service of another cluster can only be accessed from the gateway node, because the network from the non-gateway node to the gateway node is not accessible (through the vx-submariner network card). so l will read the agent-route code of create interface named vx-submariner.

huangjiasingle commented 3 months ago

@yboaron l think need to reopen this issue.

yboaron commented 3 months ago

@yboaron l think need to reopen this issue.

@huangjiasingle OK, let's reopen it.

could you please specify what issue you encountered? also elaborate on your environment (CNI, platform, Submariner version, cable driver, etc)

huangjiasingle commented 2 months ago

@yboaron my env:

cni is flannel
platform is amd64
submariner version is 0.17.1
cable driver is Libreswan
the kubernetes version is 1.30.0

huangjiasingle commented 2 months ago

btw, l read the route agent code of create vx-submariner interface. it's doesn't define the mac when create the vx-submariner interface. so l want kown who set the mac addr of vx-submariner.

yboaron commented 2 months ago

Hi @huangjiasingle , thanks for reaching out.

This looks like a data path issue that needs further investigation, please attach subctl diagnose all --kubeconfig subctl gather --kubeconfig from both clusters.