projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.98k stars 1.33k forks source link

Turn on tunl0 device pmtudisc attr avoid some kernel version bug on IPIP mode #7146

Open gaopeiliang opened 1 year ago

gaopeiliang commented 1 year ago

Current Behavior

now Felix init tunl0 device when start like this ...

bash# ip tunl show tunl0

tunl0: any/ip  remote any  local any  ttl inherit  nopmtudisc

tunl0 device has default nopmtudisc attr;

with nopmtudisc tunl0 can not adaptor mtu change with PMTU discover, and Kernel default set

/proc/sys/net/ipv4/ip_no_pmtu_disc = 0 

means IP packet always set DF bit; when host1 -> host2 mtu change small than tunl0 and calixxx because "Need Frargment" ICMP;

tunl0 will drop IP packet because small mtu problem;

the drop IP packet will also produce Need Fragment ICMP self to self, so host1 can not recover mtu forever .......

Possible Solution

  1. change tunl0 nopmtudisc attr to pmtudisc , like this
    ip tunnel change tunl0 mode ipip pmtudisc

it will make tunl0 device update link mtu to correct ...... it will be work OK!

# show route cache info

# host1 and  host2

10.201.xx.21 via 10.200.xx.1 dev bond0.114  src 10.200.xx.196 
    cache  expires 594sec mtu 1100

# in container

bash# ip route get 192.168.169.73
192.168.169.73 via 169.254.1.1 dev eth0  src 192.168.221.15 
    cache  expires 577sec mtu 1080

# on tunl0 device 

bash# ip route get 192.168.169.73 
192.168.169.73 via 10.201.40.21 dev tunl0  src 192.168.221.0 
    cache  expires 555sec mtu 1080
  1. all working in kernel about pmtu, what happen , read the fk source code ....
// when transmit ip package maybe update mtu
// /net/ipv4/ip_tunnel.c
--ip_tunnel_xmit
    --tnl_update_pmtu
static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
                struct rtable *rt, __be16 df,
                const struct iphdr *inner_iph)
{
    struct ip_tunnel *tunnel = netdev_priv(dev);
    int pkt_size;
    int mtu;

    pkt_size = skb->len - tunnel->hlen;
    pkt_size -= dev->type == ARPHRD_ETHER ? dev->hard_header_len : 0;

        //***********  if df set,  calc mtu use route mut cache 
       //***********
    if (df) {
        mtu = dst_mtu(&rt->dst) - (sizeof(struct iphdr) + tunnel->hlen);
        mtu -= dev->type == ARPHRD_ETHER ? dev->hard_header_len : 0;
    } else {
        mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
    }

    skb_dst_update_pmtu_no_confirm(skb, mtu);

        //*******  produce Need Frag ICMP to up hoop...
        //********
    if (skb->protocol == htons(ETH_P_IP)) {
        if (!skb_is_gso(skb) &&
            (inner_iph->frag_off & htons(IP_DF)) &&
            mtu < pkt_size) {
            memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
            return -E2BIG;
        }
    }
    .......
}

df bool will be set with link attr "pmtudisc" , so tunl0 this attr will affect this case .....

en ,,,, I also find diff kernel version has diff impl...

// befor Linux 4.19.168

    if (tnl_update_pmtu(dev, skb, rt, tnl_params->frag_off, inner_iph)) {
        ip_rt_put(rt);
        goto tx_error;
    }
    ......

    df = tnl_params->frag_off;
    if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df)
        df |= (inner_iph->frag_off&htons(IP_DF));

// after Linux 4.19.168

    df = tnl_params->frag_off;
    if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df)
        df |= (inner_iph->frag_off & htons(IP_DF));

    if (tnl_update_pmtu(dev, skb, rt, df, inner_iph)) {
        ip_rt_put(rt);
        goto tx_error;
    }

so before Linux 4.19.161 DF bit only care ip link attr when update mtu, after it will inherit from inner IP packet.... it is an kernel bug to handle mtu ;

fix commit log https://github.com/torvalds/linux/commit/50c661670f6a3908c273503dfa206dfc7aa54c07, the commit msg said case same as this .....


so felix can adaptor this kernel bug when init tunl0 attr pmtudisc when blow kernel version Linux 4.19.161 ......

only some suggests, or have other ideas about this attr ,,,,,

Steps to Reproduce

图片

  1. change host 10.200.xx.196 to host 10.201.xx.21 mtu use ICMP Need Fragment , (image label 3) route cache will be like this

    10.201.xx.21 via 10.200.xx.1 dev bond0.114  src 10.200.xx.196 
    cache  expires 594sec mtu 1100
  2. now container 192.168.169.73 hosted on 10.201.xx.21 get data from 192.168.221.15 on hosted 10.200.xx.196 will block because mtu problem ...

图片

Context

tcp connection has mtu problem , make our app random error for a long time ......

randon ICMP change , randon host link mtu , it is too difficulty to debug .....

Environment

BenjaminHuang commented 1 year ago

Setting pmtudisc on tun0 also forcely set DF on tunnel egress packet, this could be a drawback which should be widely tested on different circumstances. In my point of view, applying the fix or adopting tcp_mtu_probe could be better alternatives.

song-jiang commented 1 year ago

Can you manually set MTU to the correct value? For instance,

apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  name: default
spec:
  ipv6Support: false
  ipipMTU: 1400
gaopeiliang commented 1 year ago

Can you manually set MTU to the correct value? For instance,

apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  name: default
spec:
  ipv6Support: false
  ipipMTU: 1400

Manually set an safe small mtu value will work well ,that kernel pmtu func will nerver work .....

Now we optimize mul link with SDWAN, some link will send 'Need frag' to change host mtu out of our control ....

We do not want set smallest mtu also .. and it is differect to change host infra global the world ...

mazdakn commented 1 year ago

@gaopeiliang can test the issue with newer version of Calico, like 3.25? (3.26 will be released soon) I am not saying that newer versions has the fix (Maybe it has, there has been many changes including fixes since 3.13), but using newer version helps to have a more efficient discussion.

gaopeiliang commented 1 year ago

@gaopeiliang can test the issue with newer version of Calico, like 3.25? (3.26 will be released soon) I am not saying that newer versions has the fix (Maybe it has, there has been many changes including fixes since 3.13), but using newer version helps to have a more efficient discussion.

It is special kernel version bug about handle PMTU , there is no relation with special calico version , and "init tunl0 device attr" d.dataplane.RunCmd("ip", "tunnel", "add", "tunl0", "mode", "ipip") never changed !

we lastest tested calico version 3.21 .