Add support for Path MTU discovery

sathuish commented 3 years ago

We have an AWS multi-region instance. We are trying to install the application across the regions and the deployment is getting filed due to MTU auto-detection enabled in cni.

Expected Behavior

The data transfer/receiving should happen properly

Current Behavior

We are using calico version 3.18. We have set the MTU value as 0 to auto-detect the MTU for the calico. When it comes to AWS on-prem multi-region deployment, we face issues with the pod to pod communications. when we reduced the MTU value to 1350 the communication works properly without any issues.

Possible Solution

Add Path MTU discovery to Calico

Steps to Reproduce (for bugs)

Enable auto-detection in calico Deploy them in the AWS multi-region Transfer/receive the larger packets

Context

Add Path MTU discovery to Calico

Your Environment

Calico version 3.18.3 Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.20.7 Operating System and version: Centos 7.9.2009

Link to your project (optional):

caseydavenport commented 3 years ago

Yep, as you discovered Calico can only detect MTU based on the local node's configuration. This was by design, and of course has some limitations. However, like you said, manual MTU configuration does exist for such situations.

Path MTU discovery might solve this, but is an undertaking we've so far tried to avoid due to the extra complexities it involves. For now, leaving this open as an enhancement, but suggest continuing to use manually configured MTU values.

We are trying to install the application across the regions

I'd also strongly recommend against running a single cluster across multiple regions, and instead use availability zones. A single Kubernetes cluster / Calico cluster across multiple regions is bound to cause you some pain, due to added latency and instability caused by running the control plane across the public internet.

If you need redundancy, I'd recommend a separate cluster-per-region, with nodes spread across AZs within the region.

defo89 commented 9 months ago

@caseydavenport I wanted to follow up on this existing issue and raise the awareness about problems that happen when advertising services via BGP/ECMP.

In our environment we stumbled upon this and had to implement a fix. Cluster nodes are contained within a single region. Communication within a region always leverages jumbo mtu (inter-node and customer-to-externalIP). Communication between regions happens though upstream routers which have different connectivity means (main MPLS, but also backup VPN).

Even having MPLS in the path have caused issues due to 4 bytes it required for headers. Backup VPN may go though internet and mtu could be 1300 bytes.

The problem is described here https://blog.cloudflare.com/path-mtu-discovery-in-practice/ Cloudflare's implementation is available under https://github.com/cloudflare/pmtud

In our implementation (readme may not be up to date) we:

push icmp 3/4 frag-needed packets into specific nflog group
take the payload of the frag-needed packet and re-send it to all nodes within same cluster (for now, over separate L2 connection).

I am wondering whether that's something you still would consider to be in scope for Calico?

caseydavenport commented 9 months ago

@matthewdupre might be the right one to comment on this.

My first inclination is that this would be best handled as a separate solution, with Calico exposing the necessary surfaces to enable implementing PMTU without actually writing the code into Calico itself. However, I am happy to be convinced otherwise - I am not an expert on PMTU.

ehsan310 commented 5 months ago

@caseydavenport I wanted to follow up on this existing issue and raise the awareness about problems that happen when advertising services via BGP/ECMP.

In our environment we stumbled upon this and had to implement a fix. Cluster nodes are contained within a single region. Communication within a region always leverages jumbo mtu (inter-node and customer-to-externalIP). Communication between regions happens though upstream routers which have different connectivity means (main MPLS, but also backup VPN).

Even having MPLS in the path have caused issues due to 4 bytes it required for headers. Backup VPN may go though internet and mtu could be 1300 bytes.

The problem is described here https://blog.cloudflare.com/path-mtu-discovery-in-practice/ Cloudflare's implementation is available under https://github.com/cloudflare/pmtud

In our implementation (readme may not be up to date) we:

push icmp 3/4 frag-needed packets into specific nflog group

take the payload of the frag-needed packet and re-send it to all nodes within same cluster (for now, over separate L2 connection).

I am wondering whether that's something you still would consider to be in scope for Calico?

This is a very interesting discussion , as we are facing the same issue when using Calico in eBPF mode advertising our service ip , we kind of hit the problem of mtu got changed different network when reaches our service IP , when we can't respond to ICMP so packet get dropped.

tomastigera commented 5 months ago

when we can't respond to ICMP so packet get dropped.

@ehsan310 why can't you respond to icmp?

tomastigera commented 1 month ago

Any update on this one?

ehsan310 commented 1 month ago

I have upgraded the cluster but did not get a chance to change mtu to default 1500 to see if the issue is fixed , we have a production workload so have to see what can i do to test it @tomastigera

projectcalico / calico