Incorrect source IP for ICMP MTU too big replies in multi-NIC scenario

r3pek commented 3 years ago

I'm running a 3 node cluster (1 master and 2 worker nodes) with all of them connected via a private network 192.168.10.0/24 with mtu set on 1450 (it's a VPS running on Hetzner). All 3 nodes have a public facing interface with a public IP address. all communication between is done via private lan.

The current test i'm doing is deploying a sftp-server on one worker node and upload a file to it (connecting via the public interface) and check the upload speed when the connection is made for each of the worker nodes, repeating the test on eBPF and non-eBPF modes.

Expected Behavior

I expected to see some performance penalty when using eBPF but not as much as observed.

Current Behavior

When connecting to the node that doesn't have the pod running, I see this: eBPF speeds: Downloads/android-studio-ide-183.5692245-linux.tar.gz 0% 2496KB 67.9KB/s 4:20:01 ETA

non-eBPF speeds: Downloads/android-studio-ide-183.5692245-linux.tar.gz 6% 66MB 16.3MB/s 00:59 ETA

When connecting to the same node where the pod is running, speeds are almost the same on both modes (around 13/18 MB/s)

Possible Solution

Don't really know :(

Steps to Reproduce (for bugs)

I think that just setting up the same environment should do it

Context

I was truly expecting some performance penalty, but not on this orders of magnitude. 16MB/s -> 60KB/s The main reason for using eBPF was for maintaining the source IP when connection from external sources so that I can have services on the cluster that actually do something useful with that information (anti-spam for example), but since most of my services don't scale and only run one instance, i setup a DNS pointing to both nodes of the cluster and just let the cluster handle where the pod actually is, avoiding the dns-vudu and would have to be made to always have the correct (public) ip of the correct node running the pod. This might not even be a bug, but a consequence of how eBPF works and in that case, i'll just have to drop the requirement of having the source ip available.

Heres my felixconfiguration:

# ./calicoctl get felixconfiguration default -o yaml
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  creationTimestamp: "2021-02-16T00:46:06Z"
  name: default
  resourceVersion: "2665639"
  uid: a395f6ab-d0d5-4ba7-a54b-a797504970e7
spec:
  bpfEnabled: false
  bpfExternalServiceMode: Tunnel
  bpfLogLevel: ""
  logSeverityScreen: Info
  reportingInterval: 0s
  vxlanEnabled: true

Nodes info:

./calicoctl get nodes -o wide
NAME       ASN       IPV4              IPV6   
master     (64512)   192.168.10.2/32          
worker01   (64512)   192.168.10.4/32          
worker02   (64512)   192.168.10.3/32

To make the change I just patch felixconfiguration and set bpfEnabled true or false and enable or disable kube-proxy. After the changes restart all calico-node pods.

Your Environment

Calico version 3.18.0
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.20.2
Operating System and version: CentOS 8 Stream
Link to your project (optional):

tomastigera commented 3 years ago

Original conversation https://calicousers.slack.com/archives/CUKP5S64R/p1614646015032900

r3pek 3 hours ago ok.. issue is when accessing the node that doesn't have the pod running. accessing directly the node that has the pod running yields "normal" speeds r3pek 1 hour ago might be eBPF related after all.... just retested with good-old-iptables and now i'm getting normal (120-150mbps) results with it. switching back to eBPF sends the speeds back to 30-60kbps.

tomastigera commented 3 years ago

The issue is that we drop some packets because they are too large to put them in the VXLAN tunnel and we reply back with ICMP mtu too big. That is correct so far :heavy_check_mark:

However, in this scenario, the node has 2 NICs. eth0 used for external connectivity and ens10 for connectivity among the k8s nodes. When we receive the large packet on eth0 we reply back, but with the "k8s nodes IPofens10` which is private and thus the reply never makes it to the sender. Therefore the sender never updates its path MTU and thus hits the same issue every now and then, which makes TCP think that the link is congested and thus the bandwidth sucks :boom:

Excerpt from BPF logs on node2 for frame #10 in the tcpdump.

          <idle>-0     [001] ..s. 155578.474680: 0: eth0-----I: New packet at ifindex=2; mark=0
          <idle>-0     [001] .Ns. 155578.474692: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
          <idle>-0     [001] .Ns. 155578.474694: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
          <idle>-0     [001] .Ns. 155578.474695: 0: eth0-----I: TCP; ports: s=56360 d=2412
          <idle>-0     [001] .Ns. 155578.474696: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
          <idle>-0     [001] .Ns. 155578.474697: 0: eth0-----I: CT-6 lookup from b2a6310c:56360
          <idle>-0     [001] .Ns. 155578.474698: 0: eth0-----I: CT-6 lookup to   bc22be20:2412
          <idle>-0     [001] .Ns. 155578.474701: 0: eth0-----I: CT-6 Hit! NAT FWD entry, doing secondary lookup.
          <idle>-0     [001] .Ns. 155578.474703: 0: eth0-----I: CT-6 fwd tun_ip:c0a80a04
          <idle>-0     [001] .Ns. 155578.474704: 0: eth0-----I: CT-6 result: 5
          <idle>-0     [001] .Ns. 155578.474705: 0: eth0-----I: conntrack entry flags 0x4
          <idle>-0     [001] .Ns. 155578.474706: 0: eth0-----I: CT Hit
          <idle>-0     [001] .Ns. 155578.474707: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
          <idle>-0     [001] .Ns. 155578.474708: 0: eth0-----I: Entering calico_tc_skb_accepted
          <idle>-0     [001] .Ns. 155578.474709: 0: eth0-----I: src=b2a6310c dst=bc22be20
          <idle>-0     [001] .Ns. 155578.474710: 0: eth0-----I: post_nat=0:0
          <idle>-0     [001] .Ns. 155578.474711: 0: eth0-----I: tun_ip=0
          <idle>-0     [001] .Ns. 155578.474712: 0: eth0-----I: pol_rc=0
          <idle>-0     [001] .Ns. 155578.474712: 0: eth0-----I: sport=56360
          <idle>-0     [001] .Ns. 155578.474713: 0: eth0-----I: flags=0
          <idle>-0     [001] .Ns. 155578.474714: 0: eth0-----I: ct_rc=5
          <idle>-0     [001] .Ns. 155578.474714: 0: eth0-----I: ct_related=0
          <idle>-0     [001] .Ns. 155578.474715: 0: eth0-----I: ip->ttl 53
          <idle>-0     [001] .Ns. 155578.474717: 0: eth0-----I: CT: DNAT to a70cb76:22
          <idle>-0     [001] .Ns. 155578.474718: 0: eth0-----I: CT says encap to node c0a80a04

          <idle>-0     [001] .Ns. 155578.474719: 0: eth0-----I: SKB too long (len=1578) vs limit=1400
          <idle>-0     [001] .Ns. 155578.474720: 0: eth0-----I: Request packet with DNF set is too big
          <idle>-0     [001] .Ns. 155578.474721: 0: eth0-----I: Entering calico_tc_skb_send_icmp_replies
          <idle>-0     [001] .Ns. 155578.474722: 0: eth0-----I: ICMP type 3 and code 4
          <idle>-0     [001] .Ns. 155578.474723: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20

          <idle>-0     [001] .Ns. 155578.474724: 0: eth0-----I: ip->ihl: 5
          <idle>-0     [001] .Ns. 155578.474725: 0: eth0-----I: Trimming to 118
          <idle>-0     [001] .Ns. 155578.474741: 0: eth0-----I: Inserting 28
          <idle>-0     [001] .Ns. 155578.474743: 0: eth0-----I: Len after insert 146
          <idle>-0     [001] .Ns. 155578.474744: 0: eth0-----I: IP id=0 s=0 d=0
          <idle>-0     [001] .Ns. 155578.474746: 0: eth0-----I: ICMP v4 reply creation succeeded
          <idle>-0     [001] .Ns. 155578.474747: 0: eth0-----I: IP id=0 s=c0a80a03 d=b2a6310c
          <idle>-0     [001] .Ns. 155578.474748: 0: eth0-----I: Traffic is towards host namespace, marking with c3400000.
          <idle>-0     [001] .Ns. 155578.474750: 0: eth0-----I: Final result=ALLOW (0). Program execution time: 58209ns
          <idle>-0     [001] .Ns. 155578.474794: 0: eth0-----E: New packet at ifindex=2; mark=c3400000
          <idle>-0     [001] .Ns. 155578.474796: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
          <idle>-0     [001] .Ns. 155578.474797: 0: eth0-----E: ICMP; type=3 code=4
          <idle>-0     [001] .Ns. 155578.474799: 0: eth0-----E: CT-1 lookup from c0a80a03:0
          <idle>-0     [001] .Ns. 155578.474800: 0: eth0-----E: CT-1 lookup to   b2a6310c:0
          <idle>-0     [001] .Ns. 155578.474802: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
          <idle>-0     [001] .Ns. 155578.474803: 0: eth0-----E: CT-ICMP: proto 6
          <idle>-0     [001] .Ns. 155578.474804: 0: eth0-----E: CT-1 related lookup from b2a6310c:56360
          <idle>-0     [001] .Ns. 155578.474805: 0: eth0-----E: CT-1 related lookup to   bc22be20:2412
          <idle>-0     [001] .Ns. 155578.474806: 0: eth0-----E: CT-1 Hit! NAT FWD entry, doing secondary lookup.
          <idle>-0     [001] .Ns. 155578.474824: 0: eth0-----E: CT-1 fwd tun_ip:c0a80a04
          <idle>-0     [001] .Ns. 155578.474828: 0: eth0-----E: CT-1 result: 2
          <idle>-0     [001] .Ns. 155578.474829: 0: eth0-----E: CT-1 result: related
          <idle>-0     [001] .Ns. 155578.474831: 0: eth0-----E: conntrack entry flags 0x4
          <idle>-0     [001] .Ns. 155578.474832: 0: eth0-----E: CT Hit
          <idle>-0     [001] .Ns. 155578.474834: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
          <idle>-0     [001] .Ns. 155578.474834: 0: eth0-----E: Entering calico_tc_skb_accepted
          <idle>-0     [001] .Ns. 155578.474836: 0: eth0-----E: src=c0a80a03 dst=b2a6310c
          <idle>-0     [001] .Ns. 155578.474837: 0: eth0-----E: post_nat=0:0
          <idle>-0     [001] .Ns. 155578.474838: 0: eth0-----E: tun_ip=0
          <idle>-0     [001] .Ns. 155578.474839: 0: eth0-----E: pol_rc=0
          <idle>-0     [001] .Ns. 155578.474840: 0: eth0-----E: sport=0
          <idle>-0     [001] .Ns. 155578.474842: 0: eth0-----E: flags=0
          <idle>-0     [001] .Ns. 155578.474843: 0: eth0-----E: ct_rc=2
          <idle>-0     [001] .Ns. 155578.474844: 0: eth0-----E: ct_related=1
          <idle>-0     [001] .Ns. 155578.474845: 0: eth0-----E: ip->ttl 63
          <idle>-0     [001] .Ns. 155578.474847: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
          <idle>-0     [001] .Ns. 155578.474850: 0: eth0-----E: Final result=ALLOW (0). Program execution time: 53407ns

capture_node2-v2.pcap.zip

tomastigera commented 3 years ago

This is the problem https://github.com/projectcalico/felix/blob/master/bpf-gpl/icmp.h#L112-L113

Iirc the meaning of HOST_IP changed to "IP of the node int he cluster" and thus it is not correct in a multi-NIC scenario.

r3pek commented 3 years ago

Can the new dataplane (VPP) help in this situation?

projectcalico / calico