submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.43k stars 193 forks source link

ICMP `x.x.x.x` unreachable - need to frag (mtu 1450), length 556 #2887

Closed cangyin closed 9 months ago

cangyin commented 9 months ago

What happened: Non-gateways nodes cannot ping gateway nodes via vx-submariner interface. Tcpdump prints need to frag (mtu 1450) but the MTU reported by ping command is 1400.

I setup 2 clusters each with 3 hosts and installed submariner with libreswan cable driver. Broker is deployed in the first cluster.

submariner in the second cluster functions normally, and I can ping gateway nodes from non-gateways via vx-submariner interface, i.e. ping 240.x.x.x.

But in the first cluster, it does not. There is some error from tcpdump complaining about MTU size:

# On gateway node, run:
$ tcpdump -i ens18 -nn host 10.74.124.51 and not tcp
10:18:55.489550 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:18:56.289779 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:18:56.689642 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:18:57.289630 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:18:58.289783 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:18:58.689749 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:18:59.690185 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:19:00.089841 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556
10:19:01.089884 IP 10.74.124.53 > 10.74.124.51: ICMP 10.74.124.53 unreachable - need to frag (mtu 1450), length 556

When I do MTU probing with ping command on non gateway node:

# ping -M do -s 1420  240.74.124.52
PING 240.74.124.52 (240.74.124.52) 1420(1448) bytes of data.
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400
ping: local error: message too long, mtu=1400

It says MTU be 1400 ?

While in the second cluster the reported MTU is 1450, which is correct:

# ping -M do -s 1430  240.74.124.55
PING 240.74.124.55 (240.74.124.55) 1430(1458) bytes of data.
ping: local error: message too long, mtu=1450
ping: local error: message too long, mtu=1450
ping: local error: message too long, mtu=1450

These are the MTU sizes for all interfaces in the first cluster

On non gateway node:

# ip a s | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
55: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
81: calie483855c297@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
82: cali39efe866779@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
(...)
139: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
140: vx-submariner: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default

On gateway node:

# ip a s | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
48: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
60: cali0a5ef616453@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
(...)
75: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noqueue state DOWN group default qlen 1000
89: cali1fc473fc4a7@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
(...)
131: vx-submariner: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default

The need to frag error also occurs in the second cluster, but less frequently.

What you expected to happen: No need to frag error from tcpdump

How to reproduce it (as minimally and precisely as possible): I don't know how to reproduce it, please help me diagnose the problem, it's really appreciated. I have been struggling on it for about a month.

Anything else we need to know?:

Environment:

version: 0.16.2

yboaron commented 9 months ago

Hi @cangyin , thanks for reaching out.

A. Regarding [1] seems that ping resolved out network interface for destIP=240.74.124.52 to some interface with MTU of size 1400. can you share the routing table ?

Also you can try adding '-I ' to ping command with the relevant interface

B. Can you share the output of subctl verify --only connectivity --context <kubeContext1> --tocontext <kubeContext2> and subctl verify --only connectivity --packet-size 500 --context <kubeContext1> --tocontext <kubeContext2> ?

[1] ping -M do -s 1420 240.74.124.52 PING 240.74.124.52 (240.74.124.52) 1420(1448) bytes of data. ping: local error: message too long, mtu=1400 ping: local error: message too long, mtu=1400

cangyin commented 9 months ago

I changed the NIC for the VM from RealTek to Intel, and it works. The MTU size becomes all the same as 1450 for each hosts.

cangyin commented 9 months ago

@yboaron Thanks!