Closed squeed closed 6 months ago
I solved this by lowering calico's MTU -
calicoctl config set --raw=felix IpInIpMtu 1450
@squeed I guess that's why default Calico manifest uses 1440
for IPIP MTU.
I'm taking v3.0 calico.yaml spec as an example. Wish there's document somewhere stating why the settings, otherwise might be hitting the issue of yours.
It doesn't matter what the mtu is, because whatever value the pods have will be stored in the hosts' cache, for that host.
As an experiment, I set the pod's MTU to be 1460, while the MTU of the tunl0 was 1480. Because of the masquerading, the route cache used the lower value:
core@master1 ~ $ ip route get 10.1.1.50
10.1.1.50 dev ens3 src 10.1.1.10 uid 500
cache expires 323sec mtu 1460
Both of IPs are on normal 1500 byte interfaces. The mtu cache "should" show 1500.
If Linux is not sending the ICMP messages needed for pmtu discovery, then is it a matter of ensuring the ip_no_pmtu_disc and/or ip_forward_use_pmtu sysctls are set properly?
The problem is more subtle; it is managing PMTU correctly. The problem is that the same IP address (due to the masquerade) has a variable MTU. This compounds with its use as a tunnel endpoint.
I haven't tried disabling PMTU entirely. That might work, but it almost certainly causes more problems :-)
@squeed What kernel version you were using when you were testing this? I've been trying a bit to reproduce what you were seeing and have not been able to yet. I attempted to check out the mtu cache values like you showed and was unable to. After some googling to figure out why I could not get any mtu cache output I looked at the ip-route
man page which shows Starting with Linux kernel version 3.6, there is no routing cache for IPv4 anymore.
(Hence my question about kernel version.)
@tmjd it was a recent kernel version, since I was running CoreOS stable. I don't have it off-hand. I'll spin up another cluster and try and repro.
So, recent Linux kernels don't have a route-cache, that's true (they just have an efficient prefix-tree). However, they do maintain something called the "exception cache," where they store things like MTU overrides. So we're still hitting that path.
Is there anything special you did to get cache output from ip route get...
? I've tried both coreos (1576.4.0) and Ubuntu 16.04 and both produce output like the following when using the commands you were suggesting.
core@k8s-node-02 ~ $ ip route get 172.18.18.102
172.18.18.102 dev eth1 src 172.18.18.103 uid 500
cache
I've also tried using netstat -eCr
and get no cache information. (I've also tried the commands using sudo in case it was a permsissions issue.)
What is your testing environment? So far I've tried in GCE using Ubuntu and a local Vagrant setup with Coreos.
ip route <dest>
will only show mtu if there is an exception for that individual destination.
My testing environment is the CoreOS tectonic installer running on a few virtualbox machines. Nothing particularly special.
I came across this post when solving a recent AWS+CoreOS+k8s issue. This sounded like a different, Calico-specific issue. But now @squeed mentions CoreOS, then this could be related to my issue, which I documented and resolved over on the most excellent kube-aws
project. Although I focus on the VPC-level issues, I also noticed it causes Calico and Flannel to have mismatches configurations also.
https://github.com/kubernetes-incubator/kube-aws/issues/1349
CoreOS 1745.3.1 and 1745.4.0 include a networkd bug that causes problems for clusters with mixed instance types (e.g. T2 and M3/4/5). This is fixed in 1745.5.0 (stable).
All the 'current' AWS instance types support jumbo frames (MTU = 9001). This is set via DHCP, however the
networkd
in these CoreOS versions fails to do that. This leaves the instances with their default MTU. While T2 instance support MTU=9001 they appear to default to MTU=1500. This leaves you with different nodes in the cluster with different MTUs.Clients of TCP load balancers will get PMTU errors where they will think the PMTU is 8951 or 1500 when it is actually 1450. You'll tend to get MTU hangs or disconnections if connections heads to T2 worker nodes due to the incorrect MTU.
If you have T2 nodes for your control plane, if you upgrade to this versions (1745.3.1 and 1745.4.0) you'll likely see all your workers go to 'NotReady' and appear to stop reporting state to controllers via the API load balancer. In reality the controller MTU has suddenly gone from 9001 to 1500, and it takes a while for the load balancer and worker nodes to work this out. In my experience the workers should recover in about 10 minutes.
In my cluster I'm trying to figure how to set different MTU for different nodes with calico in CNI config. Is there a way to do that at all?
@whereisaaron @dimm0 this issue isn't not about the MTU of the underlying interface (though that is an interesting problem). This is specifically about the design of calico causing inconsistent MTU caching and unreachability within the overlay network. I do want to make sure this particular issue doesn't become a dumping ground for all kinds of MTU weirdness.
Some other ppl are thinking that's the same issue I'm having (https://github.com/projectcalico/calico/issues/2026), but yeah, I agree
@squeed Could u try to recreate this issue with latest CoreOS-stable
.
I tried with the following version but could not reproduce the scenario where hostA sets a mtu of 1460 for hostB in the route "cache"
$ cat /etc/lsb-release
DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1745.6.0
DISTRIB_CODENAME="Rhyolite"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1745.6.0 (Rhyolite)"
core@k8s-master ~ $ uname -r
4.14.48-coreos-r1
I've just run into the same issue. It seems that starting Kubernetes Nginx Ingress Controller in network=host mode causes the same problems. In my case lowering tunl0 MTU from 1440 to 1300 did the job and solved the problem.
In case somebody wanna to reproduce the bug. I've deployed my Kubernetes cluster on Scaleway's Fedora 28 with the latest Kubespray. Then I deployed ingress controller using Helm Chart (https://github.com/kubernetes/charts/tree/master/stable/nginx-ingress) and controller.hostNetwork
option set to true
.
Then you can just deploy any pod exposing REST endpoint and generate output larger than MTU. If you try to curl the pod endpoint you will see client waiting forever for a response. Sniffing network traffic confirms that client receives only part of the response and then waits for the rest.
@hekonsek I am also facing this intermittent problem in a 12 node prod cluster. Coredns is working but ingress controller and dashboard cant talk to Kubernetes svc. I didnt face this issue in small cluster of 4nodes. I will try changing the mtu and see if it works.
@anjuls In my case it was 3-nodes cluster.
@hekonsek I managed to fix my cluster.
@hekonsek : I'm having the same issues with a similar setup: 1+3 node cluster on top of wireguard VPN using calico cni. k8s version is 1.11 installed with kubeadm. All nodes run Debian Stretch.
I've managed to reproduce it by making a packet capture and in wireshark, I "followed" the TCP stream and saw the size of the data. In my case it is 1868.
Any response (reuqest??) with 1868 bytes or more cause gateway timeout on ingress-nginx.
To reproduce it in my case, I saved the wireshark data and used curl
curl -X POST --data @1867-bytes-of-data-work.log https://test-svc.example.com
-- this works
curl -X POST --data @1868-bytes-of-data-fail.log https://test-svc.example.com
-- this fails
I'm kind of a noob in networking area so my question is how can I determine the proper MTU value in my case, when I tunnel traffic also via wireguard VPN. I've found [1] that talks about similar issues.
A second point that I would like to raise is that this issue should be mentioned in the Calico for installation.
This is how my interfaces look like: I have different MTU values for wireguard, calico and tunnel.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
3: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
5: cali9cafa0a893e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
6: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
inet 192.168.3.1/32 brd 192.168.3.1 scope global tunl0
7: caliad17b2e6582@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
9: califcc50f7010f@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
[1] https://github.com/StreisandEffect/streisand/issues/1089
We are facing the same issues. Any update about this?
I believe this series of kernel changes will fix this: https://www.mail-archive.com/netdev@vger.kernel.org/msg345225.html
In my case, running ip route flush cache
on the machine that drops the packets solved it temporarily. After a day the problem seems to get back.
any update for this issue? i ran into similar problem at vms with running calico-node in pod not ready that looks like
...
Warning Unhealthy 20s (x382 over 63m) kubelet, k-node-master (combined from similar events): Readiness probe failed: Threshold time for bird readiness check: 30s
calico/node is not ready: BIRD is not ready: BGP not established with 10.246.*.12,10.246.*.132020-12-16 02:15:33.430 [INFO][5396] readiness.go 88: Number of node(s) with BGP peering established = 0
...
@Davidrjx try execute
sudo ufw allow 179/tcp comment "Calico networking (BGP)"
@Davidrjx try execute
sudo ufw allow 179/tcp comment "Calico networking (BGP)"
thanks and sorry for late reply.
I set tunl0 and veth mtu 1480 , and host device mtu 1500 , and /proc/sys/net/ipv4/ip_no_pmtu_disc = 0, some day , network link change , send need frag mtu 1330 ICMP error ,
route cache has been update .....
10.200.40.21 via 10.200.114.1 dev bond0.114 src 10.200.114.198 cache expires 597sec mtu 1330
but tunl0 ipip route not update ....
172.17.248.241 via 10.200.40.21 dev tunl0 src 172.17.84.128 cache expires 455sec mtu 1480
then container also send package with mtu 1480, big package will drop .......
I change tunl0 attr pmtudisc ,, with ip tunnel change tunl0 mode ipip pmtudisc
then tunl0 ipip route update 172.17.248.241 via 10.200.40.21 dev tunl0 src 172.17.84.128 cache expires 455sec mtu 1310
why not calico not set pmtudisc when setup ipip devices ?
The problem: When using Calico on Kubernetes with some host-networking pods, the Linux MTU cache results in unreachability.
This is due to IP masquerading when accessing destinations outside the ClusterCIDR, along with services running in host networking.
Setup Details:
Consider a Kubernetes cluster running Calico. Accordingly, the Calico daemon is running on every node and configures the calico device (
tunl0
) with an IP within that node's PodCIDR. PodCIDRs are chosen from the ClusterCIDR of 10.2.0.0/16.Because Calico uses ip-in-ip encapsulation, all of the pods (and the
tunl0
interface) have an MTU of 1480.The problem:
In other words, packets over 1460 bytes in size will be silently dropped for all pods between A and B.