Cannot access to pods which is not on the GW node.

Backgroud

I have two clusters:
- cluster1: service CIDR - 10.254.0.0/16, pod CIDR - 10.100.0.0/16
- cluster2: service CIDR - 10.255.0.0/16, pod CIDR - 10.110.0.0/16
CNI: Calico VxLAN
Submariner Cable Driver: VXLAN

What happened

I successfully setup submariner between two clusters with single worker node. So It works fine when pods are located at the same node with the gateway.

The problem occurs after I scaled out my worker node.

Here's some examples I'm facing on.

pod-A@gw-node@cluster1 -> pod-b@none-gw-node@cluster2 via service domain: works fine.
pod-A@gw-node@cluster1 -> pod-b@none-gw-node@cluster2 via pod IP: Connection timeout.
pod-A@none-gw-node@cluster1 -> pod-b@gw-node&none-gw-node@cluster2 via service domain & pod IP: Connection timeout.

It's very weird that I'm able to access to none-gw-pods only when I'm calling service domain(service discovery) from gw-pods.

What you expected to happen

Connections should succeed between all pods, regardless of whether I use a service domain or pod IP.

Anything else we need to know?

I have trouble-shooted as much as I could.

Rejoined cluster with subctl
Deleted submariner-routeagent pods
Checked vxlan-tunnel
Captured packets
- It seems pod-A@gw-node@cluster1 -> vxlan-tunnel -> cluster2 is OK. But after that, I have no idea what went wrong.
- Also pod-A@none-gw-node@cluster1 -> gw-node -> vxlan-tunnel is NOT OK. Cannot detect any packets that pass through vxlan-tunnel.

So I'm suspecting that the traffic between none-gw-pod and the actual gateway is not working well.

Environment

Diagnose information (use subctl diagnose all):

Cluster "cluster1"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.29.3" is supported

✓ Non-Globalnet deployment detected - checking that cluster CIDRs do not overlap
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking that gateway metrics are accessible from non-gateway nodes

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking that firewall configuration allows intra-cluster VXLAN traffic

✓ Checking that services have been exported properly

 Cluster "cluster2"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.29.3" is supported

✓ Non-Globalnet deployment detected - checking that cluster CIDRs do not overlap
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking that gateway metrics are accessible from non-gateway nodes

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking that firewall configuration allows intra-cluster VXLAN traffic

✓ Checking that services have been exported properly

Gather information (use subctl gather): cluster1.zip cluster2.zip

Firewall Check

 ✓ Checking if tunnels can be setup on the gateway node of cluster "cluster1"
 ✓ Tunnels can be established on the gateway node of cluster "cluster1"

subctl verify --context cluster1 --tocontext cluster2 --only service-discovery,connectivity --verbose

Summarizing 8 Failures:
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane, basic]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane, basic]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is not on a gateway and the remote service is on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod with HostNetworking connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod with HostNetworking connects via TCP to a remote pod when the pod is on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196
[FAIL] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod in reverse direction when the pod is not on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod [dataplane]
github.com/submariner-io/shipyard@v0.18.0/test/e2e/framework/network_pods.go:196

Ran 26 of 47 Specs in 1722.440 seconds
FAIL! -- 18 Passed | 8 Failed | 0 Pending | 21 Skipped

Cloud provider or hardware configuration:
- NHN Cloud(OpenStack based)
Install tools:
- subctl (subctl join --kubeconfig ./cluster1_kubeconfig.yaml broker-info.subm --clusterid cluster1 --cable-driver vxlan)
Others:
- There is no issue with any physical connections between clusters and nodes. I have opened all TCP, UDP between them.

Thanks for contacting @YHJ94 ,

I checked the logs and found no errors, it looks like a data path issue that needs further investigation.

It seems pod-A@gw-node@cluster1 -> vxlan-tunnel -> cluster2 is OK. But after that, I have no idea what went wrong. Also pod-A@none-gw-node@cluster1 -> gw-node -> vxlan-tunnel is NOT OK. Cannot detect any packets that pass through vxlan-tunnel.

We seem to have two different segments to troubleshoot

Egress from pod-A@none-gw-node@cluster1 -> gw-node -> vxlan-tunnel
Ingress from gw_node@clusterX to pod@non_gw_node@clusterX

For '1' the packet a. VxLAN encaspualtes via vx-submariner interface (udp port 4800) to reach the GW node b. VxLAN decapsulates via vx-submariner c. VxLAN encaspualtes via vxlan-tunnel interface (udp port 4500) to reach the GW node in remote cluster

Tcpdumping vx-submariner and vxlan-tunnel can help us understand the root cause here

for '2' packet A. VxLAN decapsulates via vxlan-tunnel (udp port 4500) B. Calico should forward the packet to pod@non_gw_node

Also here tcpdumping the traffic on GW node can point us to the root cause

Do you have any security group in your Infra/Openstack cloud that might block inter-cluster traffic or submariner intra-cluster traffic (port 4800)? Do you have network policy defined in your clusters?

Thanks for your support @yboaron .

First, there's no network policies in my cluster and my security group already has UDP 4500, 4800 bidirectional rule. (Actually I allowed all TCP, UDP traffics between two clusters.)

And I tcpdumped as much as I can, here's what I got.

Test Enviornment

cluster1

kc get nodes -o wide --context cluster1

NAME                             STATUS   ROLES    AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   
cluster1-default-worker-node-0   Ready    <none>   4d16h   v1.29.3   192.168.0.6     <none> 
cluster1-default-worker-node-1   Ready    <none>   3d15h   v1.29.3   192.168.0.22    <none>

kc get pods -o wide --context cluster1

NAME                           READY   STATUS    RESTARTS   AGE     IP              NODE
curl-pod                       1/1     Running   0          3d22h   10.100.111.12   cluster1-default-worker-node-0  # Simple busybox pod that request 'curl'
curl-pod2                      1/1     Running   0          3d12h   10.100.79.16    cluster1-default-worker-node-1

cluster2

kc get nodes -o wide --context cluster2

NAME                             STATUS   ROLES    AGE     VERSION   INTERNAL-IP     EXTERNAL-IP
cluster2-default-worker-node-0   Ready    <none>   4d16h   v1.29.3   192.168.0.9     <none>
cluster2-default-worker-node-1   Ready    <none>   3d15h   v1.29.3   192.168.0.106   <none>

kc get svc,pod -o wide --context cluster2

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)
service/hello-world   ClusterIP   10.255.210.51   <none>        9090/TCP # Service port 9090 -> container port 8080

NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE
pod/hello-world-66c79b8cf7-dpgcp   1/1     Running   0          99m   10.110.216.74   cluster2-default-worker-node-1 # Simple 'hello world' pod that response some texts.
pod/hello-world-66c79b8cf7-lsxbq   1/1     Running   0          99m   10.110.33.72    cluster2-default-worker-node-0

Scenario #1

curl-pod@none-gw-node@cluster1 --> hello-world-service(10.255.210.51:9090)
Results: Timeout
tcpdump
a. VxLAN encaspualtes via vx-submariner interface (udp port 4800) to reach the GW node
```
sudo tcpdump -i vx-submariner port 9090
```
cluster1-default-worker-node-1(none-gw-node)

09:52:55.884832 IP cluster1-default-worker-node-1.60587 > 10.255.210.51.9090: Flags [S], seq 2522249798, win 65280, options [mss 1360,sackOK,TS val 673371372 ecr 0,nop,wscale 7], length 0 09:52:56.907027 IP cluster1-default-worker-node-1.60587 > 10.255.210.51.9090: Flags [S], seq 2522249798, win 65280, options [mss 1360,sackOK,TS val 673372395 ecr 0,nop,wscale 7], length 0 09:52:58.922983 IP cluster1-default-worker-node-1.60587 > 10.255.210.51.9090: Flags [S], seq 2522249798, win 65280, options [mss 1360,sackOK,TS val 673374411 ecr 0,nop,wscale 7], length 0

cluster1-default-worker-node-0(gw-node)

09:52:55.886489 IP 240.168.0.22.60587 > 10.255.210.51.9090: Flags [S], seq 2522249798, win 65280, options [mss 1360,sackOK,TS val 673371372 ecr 0,nop,wscale 7], length 0 09:52:56.908019 IP 240.168.0.22.60587 > 10.255.210.51.9090: Flags [S], seq 2522249798, win 65280, options [mss 1360,sackOK,TS val 673372395 ecr 0,nop,wscale 7], length 0 09:52:58.924002 IP 240.168.0.22.60587 > 10.255.210.51.9090: Flags [S], seq 2522249798, win 65280, options [mss 1360,sackOK,TS val 673374411 ecr 0,nop,wscale 7], length 0

sudo tcpdump -i any udp port 4800

cluster1-default-worker-node-1(none-gw-node)

10:15:00.544936 eth0 Out IP cluster1-default-worker-node-1.41404 > 192.168.0.6.4800: UDP, length 82

cluster1-default-worker-node-0(gw-node)

10:15:00.547087 eth0 In IP 192.168.0.22.41404 > cluster1-default-worker-node-0.4800: UDP, length 82
```
- I assume this step is working properly.
<br>

> b. VxLAN decapsulates via vx-submariner
> c. VxLAN encaspualtes via vxlan-tunnel interface (udp port 4500) to reach the GW node in remote cluster
```
sudo tcpdump -i vxlan-tunnel port 9090

cluster1-default-worker-node-0(gw-node)

0 packets captured 0 packets received by filter 0 packets dropped by kernel

sudo tcpdump -i any udp port 4500

10:19:41.560211 eth0 In IP 192.168.0.9.33681 > cluster1-default-worker-node-0.ipsec-nat-t: UDP-encap: ESP(spi=0x08000000,seq=0x3e800), length 74 10:19:41.560331 eth0 Out IP cluster1-default-worker-node-0.49365 > 192.168.0.9.ipsec-nat-t: UDP-encap: ESP(spi=0x08000000,seq=0x3e800), length 74 ... ...
```
- There is no actual packets for a requested port.
- When I dump for udp 4500, I get bunch of these 'ipsec-nat-t' encapsulated packets. But I don't think they mean anything.
- **No packets pass through via vxlan-tunnel. Thus, there is no ingress on cluster2.**
```

Scenario #2

curl-pod@gw-node@cluster1 --> hello-world-service(10.255.210.51:9090)
Results: OK
tcpdump
A. VxLAN decapsulates via vxlan-tunnel (udp port 4500)
```
sudo tcpdump -i vxlan-tunnel port 9090
```
cluster1-default-worker-node-0(gw-node)

10:32:16.666199 IP cluster1-default-worker-node-0.54890 > 10.255.210.51.9090: Flags [S], seq 4098193096, win 65280, options [mss 1360,sackOK,TS val 1634855176 ecr 0,nop,wscale 7], length 0 10:32:16.666961 IP 10.255.210.51.9090 > cluster1-default-worker-node-0.54890: Flags [S.], seq 1818495669, ack 4098193097, win 64704, options [mss 1360,sackOK,TS val 3091215469 ecr 1634855176,nop,wscale 7], length 0

cluster2-default-worker-node-0(gw-node)

10:32:16.666243 IP 241.168.0.6.54890 > 10.255.210.51.9090: Flags [S], seq 4098193096, win 65280, options [mss 1360,sackOK,TS val 1634855176 ecr 0,nop,wscale 7], length 0 10:32:16.666551 IP 10.255.210.51.9090 > 241.168.0.6.54890: Flags [S.], seq 1818495669, ack 4098193097, win 64704, options [mss 1360,sackOK,TS val 3091215469 ecr 1634855176,nop,wscale 7], length 0

sudo tcpdump -i any udp port 4500

cluster1-default-worker-node-0(gw-node)

10:37:16.915030 eth0 In IP 192.168.0.9.41008 > cluster1-default-worker-node-0.ipsec-nat-t: UDP-encap: ESP(spi=0x08000000,seq=0x3e800), length 259

cluster2-default-worker-node-0(gw-node)

10:37:16.914657 eth0 Out IP cluster2-default-worker-node-0.41008 > 192.168.0.6.ipsec-nat-t: UDP-encap: ESP(spi=0x08000000,seq=0x3e800), length 259
```
- The encapsulated packet contains requested data payload. So I think this step works fine.
<br>

> B. Calico should forward the packet to pod@non_gw_node
```
cluster2-default-worker-node-0(gw-node)

sudo tcpdump -i any host 192.168.0.106

10:44:26.606851 eth0 Out IP cluster2-default-worker-node-0.37899 > 192.168.0.106.4789: VXLAN, flags [I] (0x08), vni 4096 IP cluster2-default-worker-node-0.65474 > 10.110.216.74.http-alt: Flags [P.], seq 1:83, ack 1, win 510, options [nop,nop,TS val 1635585116 ecr 769177139], length 82: HTTP: GET / HTTP/1.1

cluster2-default-worker-node-1(none-gw-node)

sudo tcpdump -i any host 192.168.0.9

10:44:26.606735 eth0 In IP 192.168.0.9.37899 > cluster2-default-worker-node-1.4789: VXLAN, flags [I] (0x08), vni 4096 IP 10.110.33.0.65474 > 10.110.216.74.http-alt: Flags [.], ack 1, win 510, options [nop,nop,TS val 1635585116 ecr 769177139], length 0
```
- Packet forwarding via calico vxlan is working properly.
```

Sorry for the late response,

Can you please try setting the cable driver to libreswan and see if that helps?

You can add below flags [1] to subctl join command to set the cable driver to libreswan

[1] --cable-driver libreswan --force-udp-encaps

@YHJ94 , Any update on this issue ?

@yboaron , So sorry. I was not able to test it due to my current situation. I will try it out and let you know as soon as possible. Thanks.

@YHJ94 - closing this issue for now. Feel free to reopen if you still need any help

submariner-io / submariner