Pod in nat-outgoing should not be SNATed when it accesses local cluster hosts

wayne-cheng commented 1 week ago

When I enable the setting NatOutgoing, I notice that the pod's traffic is also SNATted when it accessing local cluster hosts. I think it is unnecessary as it cause some performance degradation.

Below is the result of tcpdump capturing ping packets from the pod (177.65.1.1) to a cluster host (192.168.1.83), which confirms this behavior: Pod (177.65.1.1) -> Host (192.168.1.83) The pod (177.65.1.1):

$ tcpdump -vnn -i eth0 icmp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:27:48.148126 IP (tos 0x0, ttl 64, id 19509, offset 0, flags [DF], proto ICMP (1), length 84)
    177.65.1.1 > 192.168.1.83: ICMP echo request, id 840, seq 0, length 64
09:27:48.148452 IP (tos 0x0, ttl 63, id 49483, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.1.83 > 177.65.1.1: ICMP echo reply, id 840, seq 0, length 64
09:27:49.149174 IP (tos 0x0, ttl 64, id 19588, offset 0, flags [DF], proto ICMP (1), length 84)
    177.65.1.1 > 192.168.1.83: ICMP echo request, id 840, seq 1, length 64

The host (192.168.1.84) where the pod deployed on:

You can observe that the traffic is SNATed to the host ip 192.168.1.84 as it leave the host.

$ tcpdump -vnn -i enp1s0 icmp
tcpdump: listening on enp1s0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:27:48.148208 IP (tos 0x0, ttl 63, id 19509, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.84 > 192.168.1.83: ICMP echo request, id 64813, seq 0, length 64
09:27:48.148420 IP (tos 0x0, ttl 64, id 49483, offset 0, flags [none], proto ICMP (1), length 84)
192.168.1.83 > 192.168.1.84: ICMP echo reply, id 64813, seq 0, length 64
09:27:49.149229 IP (tos 0x0, ttl 63, id 19588, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.84 > 192.168.1.83: ICMP echo request, id 64813, seq 1, length 64

So, I manually modified the iptables rules generated by Calico to include matching for the cluster host addresses.

$ iptables -t nat -nvL cali-nat-outgoing  --line-numbers
Chain cali-nat-outgoing (1 references)
num   pkts bytes target     prot opt in     out     source               destination         
1       52  3901 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully

$ iptables -t nat -R cali-nat-outgoing 1  -m comment --comment "cali:flqWnvo8yq4ULQLa" -m set --match-set cali40masq-ipam-pools src -m set ! --match-set cali40all-ipam-pools dst -m set ! --match-set cali40all-hosts-net dst -j MASQUERADE --random-fully

$ iptables -t nat -nvL cali-nat-outgoing  --line-numbers
Chain cali-nat-outgoing (1 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst ! match-set cali40all-hosts-net dst random-fully

After making this change, the traffic from the Pod to the cluster hosts is no longer SNATed:

$ tcpdump -vnn -i enp1s0 icmp
tcpdump: listening on enp1s0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:11:18.546251 IP (tos 0x0, ttl 63, id 21893, offset 0, flags [DF], proto ICMP (1), length 84)
    177.65.1.1 > 192.168.1.83: ICMP echo request, id 836, seq 0, length 64
08:11:18.546529 IP (tos 0x0, ttl 64, id 59989, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.1.83 > 177.65.1.1: ICMP echo reply, id 836, seq 0, length 64
08:11:19.547256 IP (tos 0x0, ttl 63, id 21992, offset 0, flags [DF], proto ICMP (1), length 84)
    177.65.1.1 > 192.168.1.83: ICMP echo request, id 836, seq 1, length 64

In the Calico source code, this is a simple change. I hope you can review my upcoming PR.

Recently, I have also been testing Calico for Windows and encountered the same issue.

However, it is more severe as it prevents cluster hosts from connecting to Windows containers. Below is the Wireshark capture of ping packets on the Windows server (192.168.1.74):

Linux host (192.168.1.83) -> Windows container (177.65.1.175):

You can observe that the response packets is also SNATed, which should not happen. On the Linux kernel, there is conntrack, which prevents this problem. This might be a peculiarity of the Windows system, though the specific details are unclear since Windows is not open source.

Windows container (177.65.1.175) -> Linux host (192.168.1.83):

This behavior is expected but the SNAT is unnecessary, just like the situation I mentioned earlier on Calico for Linux.

I modified the C:\Program Files\containerd\cni\conf file to add local cluster hosts (192.168.1.0/24) to the ExceptionList, it resolved the problem(but it only removes the NAT).

{
   "policies":  [
    {
      "Name":  "EndpointPolicy",
      "Value":  {
        "Type":  "OutBoundNAT",
        "ExceptionList":  [
          "10.96.0.0/16",
          "192.168.1.0/24"
        ]
      }
    }
   ]
}

The Calico documentation mentions the natOutgoing setting for Windows, but I found that it does not match the current behavior. Additionally, I noticed that windows_disable_host_subnet_nat_exclusion has been removed from the code, and I am unsure why this change was made.

I think we can add back the logic to exclude cluster hosts. If you agree, I'd like do some test and then submit another PR.

Your Environment

Calico version: 3.27.3
Orchestrator version (e.g. kubernetes, mesos, rkt): K8s 1.23
Operating System and version: Ubtuntu Linux 20.04, Windows server 2022

coutinhop commented 5 days ago

@wayne-cheng, thanks for the thorough analysis!

We actually considered doing something like this early on in Calico but we found that it only works if your network is permissive to unknown source IPs. If your network implements reverse path filtering (RPF) then the pod-to-host traffic will be dropped. In any case you get asymmetric routing where the return traffic goes over the tunnel. This can cause other problems, like for example different MTUs for ingress/egress traffic. It was a decision to trade off a potential performance hit when no SNAT would work vs. breaking multiple use-cases when it wouldn't work.

We could definitely look into making this a configurable setting, but shouldn't unconditionally disable SNAT for cluster hosts/nodes.

wayne-cheng commented 3 days ago

@coutinhop OK, I have now defined a DisableHostSubnetNATExclusion field in felixconfig. PTAL my submitted PR #8961.

// When set to true and ip pool setting `natOutgoing` is true, packets sent from Calico networked containers in this pool
// to cluster host subnet will not be excluded from being masqueraded.  [Default: false]
DisableHostSubnetNATExclusion bool `json:"disableHostSubnetNATExclusion,omitempty"`

This way, when felix generates SNAT iptable rules on Linux node, it will use this field to make decisions.

However, for Windows, I found that NAT rules are applied during the call of the CNI plugin to add the network, rather than being implemented by felix itself. If make this a configurable setting, it means that this setting will only take effect after the Pod is created, as stated in the Calico documentation.

The code (@song-jiang) that was previously removed placed this windows_disable_host_subnet_nat_exclusion logic in the CNI configuration file, requiring individual settings on each host, maybe it is not an ideal solution. I think this config field can be placed in the global FelixConfiguration named default , and the Calico Windows CNI implementation can fetch it via CalicoClient. However, this way would ignore the setting of host configuration files or environment variables.

If you think it feasible, I will proceed with implementing it for Windows immediately.

projectcalico / calico

Pod in nat-outgoing should not be SNATed when it accesses local cluster hosts #8960

Your Environment