submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.4k stars 188 forks source link

Cannot access to pods which is not on the GW node. #3140

Open YHJ94 opened 2 weeks ago

YHJ94 commented 2 weeks ago

Backgroud

What happened

I successfully setup submariner between two clusters with single worker node. So It works fine when pods are located at the same node with the gateway.

The problem occurs after I scaled out my worker node.

Here's some examples I'm facing on.

  1. pod-A@gw-node@cluster1 -> pod-b@none-gw-node@cluster2 via service domain: works fine.
  2. pod-A@gw-node@cluster1 -> pod-b@none-gw-node@cluster2 via pod IP: Connection timeout.
  3. pod-A@none-gw-node@cluster1 -> pod-b@gw-node&none-gw-node@cluster2 via service domain & pod IP: Connection timeout.

It's very weird that I'm able to access to none-gw-pods only when I'm calling service domain(service discovery) from gw-pods.

What you expected to happen

Connections should succeed between all pods, regardless of whether I use a service domain or pod IP.

Anything else we need to know?

I have trouble-shooted as much as I could.

So I'm suspecting that the traffic between none-gw-pod and the actual gateway is not working well.

Environment

yboaron commented 2 weeks ago

Thanks for contacting @YHJ94 ,

I checked the logs and found no errors, it looks like a data path issue that needs further investigation.

It seems pod-A@gw-node@cluster1 -> vxlan-tunnel -> cluster2 is OK. But after that, I have no idea what went wrong. Also pod-A@none-gw-node@cluster1 -> gw-node -> vxlan-tunnel is NOT OK. Cannot detect any packets that pass through vxlan-tunnel.

We seem to have two different segments to troubleshoot

  1. Egress from pod-A@none-gw-node@cluster1 -> gw-node -> vxlan-tunnel
  2. Ingress from gw_node@clusterX to pod@non_gw_node@clusterX

For '1' the packet a. VxLAN encaspualtes via vx-submariner interface (udp port 4800) to reach the GW node b. VxLAN decapsulates via vx-submariner c. VxLAN encaspualtes via vxlan-tunnel interface (udp port 4500) to reach the GW node in remote cluster

Tcpdumping vx-submariner and vxlan-tunnel can help us understand the root cause here

for '2' packet A. VxLAN decapsulates via vxlan-tunnel (udp port 4500) B. Calico should forward the packet to pod@non_gw_node

Also here tcpdumping the traffic on GW node can point us to the root cause

Do you have any security group in your Infra/Openstack cloud that might block inter-cluster traffic or submariner intra-cluster traffic (port 4800)? Do you have network policy defined in your clusters?

YHJ94 commented 2 weeks ago

Thanks for your support @yboaron .

First, there's no network policies in my cluster and my security group already has UDP 4500, 4800 bidirectional rule. (Actually I allowed all TCP, UDP traffics between two clusters.)

And I tcpdumped as much as I can, here's what I got.

Test Enviornment

Scenario #1

Scenario #2

yboaron commented 1 week ago

Sorry for the late response,

Can you please try setting the cable driver to libreswan and see if that helps?

You can add below flags [1] to subctl join command to set the cable driver to libreswan

[1] --cable-driver libreswan --force-udp-encaps

yboaron commented 10 hours ago

@YHJ94 , Any update on this issue ?