MTU issue with IPIP and host-networking

squeed commented 6 years ago

The problem: When using Calico on Kubernetes with some host-networking pods, the Linux MTU cache results in unreachability.

This is due to IP masquerading when accessing destinations outside the ClusterCIDR, along with services running in host networking.

Setup Details:

Consider a Kubernetes cluster running Calico. Accordingly, the Calico daemon is running on every node and configures the calico device (tunl0) with an IP within that node's PodCIDR. PodCIDRs are chosen from the ClusterCIDR of 10.2.0.0/16.

Because Calico uses ip-in-ip encapsulation, all of the pods (and the tunl0 interface) have an MTU of 1480.

Host A (10.1.1.50) is a Kubernetes node. The calico daemon has set up the tunnel and given it a Calico IP of 10.2.0.1
- Pod A1 is on host networking, so has a PodIP of 10.1.1.10
- Pod A2 has a Calico IP, so has a PodIP of 10.2.0.2
Host B (10.1.1.51) has a single pod:
- Pod B1 uses Calico, an has a PodIP of 10.2.1.2

The problem:

Pod B1 opens a connection to pod A1 (on host networking). A TCP SYN is sent to 10.1.1.50, the HostIP of host A with an MSS of 1460 (the pod's eth0 MTU of 1480 less TCP overhead).
Host B masquerades the source IP, using the outgoing interface 10.1.1.51
Host A sees a SYN from 10.1.1.51 with a TCP MSS of 1460. It stores 1480 in its route cache's MTU field for 10.1.1.51. The connection proceeds normally, and is closed.
Pod B1 opens a connection to pod A2. A SYN is sent to 10.2.0.2, and the connection is established over the ipip tunnel.
A2 tries to send a large response, and it is broken in to 1480 byte packets. The DF bit is set, since this is TCP. The packet leaves the pod and goes to the host.
Host A tries to encapsulate the packet, adding 20 bytes of overhead.
The packet, now 1500 bytes, is too large to be sent to its destination IP of 10.1.1.51 and is dropped. Linux does not generate an ICMP "Packet too big" message.

In other words, packets over 1460 bytes in size will be silently dropped for all pods between A and B.

fxpester commented 6 years ago

I solved this by lowering calico's MTU - calicoctl config set --raw=felix IpInIpMtu 1450

unicell commented 6 years ago

@squeed I guess that's why default Calico manifest uses 1440 for IPIP MTU.

https://github.com/projectcalico/calico/blob/v3.0.3/v3.0/getting-started/kubernetes/installation/hosted/kubeadm/1.7/calico.yaml#L230-L232

I'm taking v3.0 calico.yaml spec as an example. Wish there's document somewhere stating why the settings, otherwise might be hitting the issue of yours.

squeed commented 6 years ago

It doesn't matter what the mtu is, because whatever value the pods have will be stored in the hosts' cache, for that host.

As an experiment, I set the pod's MTU to be 1460, while the MTU of the tunl0 was 1480. Because of the masquerading, the route cache used the lower value:

core@master1 ~ $ ip route get 10.1.1.50
10.1.1.50 dev ens3 src 10.1.1.10 uid 500 
    cache expires 323sec mtu 1460

Both of IPs are on normal 1500 byte interfaces. The mtu cache "should" show 1500.

detiber commented 6 years ago

If Linux is not sending the ICMP messages needed for pmtu discovery, then is it a matter of ensuring the ip_no_pmtu_disc and/or ip_forward_use_pmtu sysctls are set properly?

squeed commented 6 years ago

The problem is more subtle; it is managing PMTU correctly. The problem is that the same IP address (due to the masquerade) has a variable MTU. This compounds with its use as a tunnel endpoint.

I haven't tried disabling PMTU entirely. That might work, but it almost certainly causes more problems :-)

tmjd commented 6 years ago

@squeed What kernel version you were using when you were testing this? I've been trying a bit to reproduce what you were seeing and have not been able to yet. I attempted to check out the mtu cache values like you showed and was unable to. After some googling to figure out why I could not get any mtu cache output I looked at the ip-route man page which shows Starting with Linux kernel version 3.6, there is no routing cache for IPv4 anymore. (Hence my question about kernel version.)

squeed commented 6 years ago

@tmjd it was a recent kernel version, since I was running CoreOS stable. I don't have it off-hand. I'll spin up another cluster and try and repro.

So, recent Linux kernels don't have a route-cache, that's true (they just have an efficient prefix-tree). However, they do maintain something called the "exception cache," where they store things like MTU overrides. So we're still hitting that path.

tmjd commented 6 years ago

Is there anything special you did to get cache output from ip route get...? I've tried both coreos (1576.4.0) and Ubuntu 16.04 and both produce output like the following when using the commands you were suggesting.

core@k8s-node-02 ~ $ ip route get 172.18.18.102
172.18.18.102 dev eth1 src 172.18.18.103 uid 500 
    cache

I've also tried using netstat -eCr and get no cache information. (I've also tried the commands using sudo in case it was a permsissions issue.)

What is your testing environment? So far I've tried in GCE using Ubuntu and a local Vagrant setup with Coreos.

squeed commented 6 years ago

ip route <dest> will only show mtu if there is an exception for that individual destination.

My testing environment is the CoreOS tectonic installer running on a few virtualbox machines. Nothing particularly special.

whereisaaron commented 6 years ago

I came across this post when solving a recent AWS+CoreOS+k8s issue. This sounded like a different, Calico-specific issue. But now @squeed mentions CoreOS, then this could be related to my issue, which I documented and resolved over on the most excellent kube-aws project. Although I focus on the VPC-level issues, I also noticed it causes Calico and Flannel to have mismatches configurations also.

https://github.com/kubernetes-incubator/kube-aws/issues/1349

CoreOS 1745.3.1 and 1745.4.0 include a networkd bug that causes problems for clusters with mixed instance types (e.g. T2 and M3/4/5). This is fixed in 1745.5.0 (stable).

All the 'current' AWS instance types support jumbo frames (MTU = 9001). This is set via DHCP, however the networkd in these CoreOS versions fails to do that. This leaves the instances with their default MTU. While T2 instance support MTU=9001 they appear to default to MTU=1500. This leaves you with different nodes in the cluster with different MTUs.

Clients of TCP load balancers will get PMTU errors where they will think the PMTU is 8951 or 1500 when it is actually 1450. You'll tend to get MTU hangs or disconnections if connections heads to T2 worker nodes due to the incorrect MTU.

If you have T2 nodes for your control plane, if you upgrade to this versions (1745.3.1 and 1745.4.0) you'll likely see all your workers go to 'NotReady' and appear to stop reporting state to controllers via the API load balancer. In reality the controller MTU has suddenly gone from 9001 to 1500, and it takes a while for the load balancer and worker nodes to work this out. In my experience the workers should recover in about 10 minutes.

dimm0 commented 6 years ago

In my cluster I'm trying to figure how to set different MTU for different nodes with calico in CNI config. Is there a way to do that at all?

squeed commented 6 years ago

@whereisaaron @dimm0 this issue isn't not about the MTU of the underlying interface (though that is an interesting problem). This is specifically about the design of calico causing inconsistent MTU caching and unreachability within the overlay network. I do want to make sure this particular issue doesn't become a dumping ground for all kinds of MTU weirdness.

dimm0 commented 6 years ago

Some other ppl are thinking that's the same issue I'm having (https://github.com/projectcalico/calico/issues/2026), but yeah, I agree

saumoh commented 6 years ago

@squeed Could u try to recreate this issue with latest CoreOS-stable. I tried with the following version but could not reproduce the scenario where hostA sets a mtu of 1460 for hostB in the route "cache"

$ cat /etc/lsb-release 
DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1745.6.0
DISTRIB_CODENAME="Rhyolite"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1745.6.0 (Rhyolite)"
core@k8s-master ~ $ uname -r
4.14.48-coreos-r1

hekonsek commented 6 years ago

I've just run into the same issue. It seems that starting Kubernetes Nginx Ingress Controller in network=host mode causes the same problems. In my case lowering tunl0 MTU from 1440 to 1300 did the job and solved the problem.

hekonsek commented 6 years ago

In case somebody wanna to reproduce the bug. I've deployed my Kubernetes cluster on Scaleway's Fedora 28 with the latest Kubespray. Then I deployed ingress controller using Helm Chart (https://github.com/kubernetes/charts/tree/master/stable/nginx-ingress) and controller.hostNetwork option set to true.

Then you can just deploy any pod exposing REST endpoint and generate output larger than MTU. If you try to curl the pod endpoint you will see client waiting forever for a response. Sniffing network traffic confirms that client receives only part of the response and then waits for the rest.

anjuls commented 6 years ago

@hekonsek I am also facing this intermittent problem in a 12 node prod cluster. Coredns is working but ingress controller and dashboard cant talk to Kubernetes svc. I didnt face this issue in small cluster of 4nodes. I will try changing the mtu and see if it works.

hekonsek commented 6 years ago

@anjuls In my case it was 3-nodes cluster.

anjuls commented 6 years ago

@hekonsek I managed to fix my cluster.

I switched to the latest calico version v3.1.3
I also moved my etcd outside kubernetes instead of using calico etcd. I was having 3 master nodes and what I found that calico-etcd instances were not clustered. The yaml doesn't support HA. So all three etcd instances were running independently causing random network related problems in cluster. Now all services are working fine.

ieugen commented 6 years ago

@hekonsek : I'm having the same issues with a similar setup: 1+3 node cluster on top of wireguard VPN using calico cni. k8s version is 1.11 installed with kubeadm. All nodes run Debian Stretch.

I've managed to reproduce it by making a packet capture and in wireshark, I "followed" the TCP stream and saw the size of the data. In my case it is 1868. Any response (reuqest??) with 1868 bytes or more cause gateway timeout on ingress-nginx.
To reproduce it in my case, I saved the wireshark data and used curl

curl -X POST --data @1867-bytes-of-data-work.log https://test-svc.example.com -- this works curl -X POST --data @1868-bytes-of-data-fail.log https://test-svc.example.com -- this fails

I'm kind of a noob in networking area so my question is how can I determine the proper MTU value in my case, when I tunnel traffic also via wireguard VPN. I've found [1] that talks about similar issues.

A second point that I would like to raise is that this issue should be mentioned in the Calico for installation.

This is how my interfaces look like: I have different MTU values for wireguard, calico and tunnel.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
3: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
5: cali9cafa0a893e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
6: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 192.168.3.1/32 brd 192.168.3.1 scope global tunl0
7: caliad17b2e6582@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
9: califcc50f7010f@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default

[1] https://github.com/StreisandEffect/streisand/issues/1089

boranx commented 6 years ago

We are facing the same issues. Any update about this?

squeed commented 4 years ago

I believe this series of kernel changes will fix this: https://www.mail-archive.com/netdev@vger.kernel.org/msg345225.html

ocherfas commented 4 years ago

In my case, running ip route flush cache on the machine that drops the packets solved it temporarily. After a day the problem seems to get back.

Davidrjx commented 3 years ago

any update for this issue? i ran into similar problem at vms with running calico-node in pod not ready that looks like

...
 Warning  Unhealthy  20s (x382 over 63m)  kubelet, k-node-master  (combined from similar events): Readiness probe failed: Threshold time for bird readiness check:  30s
calico/node is not ready: BIRD is not ready: BGP not established with 10.246.*.12,10.246.*.132020-12-16 02:15:33.430 [INFO][5396] readiness.go 88: Number of node(s) with BGP peering established = 0
...

enginious-dev commented 3 years ago

@Davidrjx try execute sudo ufw allow 179/tcp comment "Calico networking (BGP)"

Davidrjx commented 3 years ago

@Davidrjx try execute sudo ufw allow 179/tcp comment "Calico networking (BGP)"

thanks and sorry for late reply.

gaopeiliang commented 3 years ago

I set tunl0 and veth mtu 1480 , and host device mtu 1500 , and /proc/sys/net/ipv4/ip_no_pmtu_disc = 0, some day , network link change , send need frag mtu 1330 ICMP error ,

route cache has been update .....

10.200.40.21 via 10.200.114.1 dev bond0.114 src 10.200.114.198 cache expires 597sec mtu 1330

but tunl0 ipip route not update ....

172.17.248.241 via 10.200.40.21 dev tunl0 src 172.17.84.128 cache expires 455sec mtu 1480

then container also send package with mtu 1480, big package will drop .......

I change tunl0 attr pmtudisc ,, with ip tunnel change tunl0 mode ipip pmtudisc

then tunl0 ipip route update 172.17.248.241 via 10.200.40.21 dev tunl0 src 172.17.84.128 cache expires 455sec mtu 1310

why not calico not set pmtudisc when setup ipip devices ?

projectcalico / calico

MTU issue with IPIP and host-networking #1709

Setup Details:

The problem: