Open r3pek opened 3 years ago
Original conversation https://calicousers.slack.com/archives/CUKP5S64R/p1614646015032900
r3pek 3 hours ago ok.. issue is when accessing the node that doesn't have the pod running. accessing directly the node that has the pod running yields "normal" speeds  r3pek 1 hour ago might be eBPF related after all.... just retested with good-old-iptables and now i'm getting normal (120-150mbps) results with it. switching back to eBPF sends the speeds back to 30-60kbps.
The issue is that we drop some packets because they are too large to put them in the VXLAN tunnel and we reply back with ICMP mtu too big. That is correct so far :heavy_check_mark:
However, in this scenario, the node has 2 NICs. eth0
used for external connectivity and ens10
for connectivity among the k8s nodes. When we receive the large packet on eth0
we reply back, but with the "k8s nodes IPof
ens10` which is private and thus the reply never makes it to the sender. Therefore the sender never updates its path MTU and thus hits the same issue every now and then, which makes TCP think that the link is congested and thus the bandwidth sucks :boom:
Excerpt from BPF logs on node2 for frame #10 in the tcpdump.
<idle>-0 [001] ..s. 155578.474680: 0: eth0-----I: New packet at ifindex=2; mark=0
<idle>-0 [001] .Ns. 155578.474692: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
<idle>-0 [001] .Ns. 155578.474694: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
<idle>-0 [001] .Ns. 155578.474695: 0: eth0-----I: TCP; ports: s=56360 d=2412
<idle>-0 [001] .Ns. 155578.474696: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
<idle>-0 [001] .Ns. 155578.474697: 0: eth0-----I: CT-6 lookup from b2a6310c:56360
<idle>-0 [001] .Ns. 155578.474698: 0: eth0-----I: CT-6 lookup to bc22be20:2412
<idle>-0 [001] .Ns. 155578.474701: 0: eth0-----I: CT-6 Hit! NAT FWD entry, doing secondary lookup.
<idle>-0 [001] .Ns. 155578.474703: 0: eth0-----I: CT-6 fwd tun_ip:c0a80a04
<idle>-0 [001] .Ns. 155578.474704: 0: eth0-----I: CT-6 result: 5
<idle>-0 [001] .Ns. 155578.474705: 0: eth0-----I: conntrack entry flags 0x4
<idle>-0 [001] .Ns. 155578.474706: 0: eth0-----I: CT Hit
<idle>-0 [001] .Ns. 155578.474707: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
<idle>-0 [001] .Ns. 155578.474708: 0: eth0-----I: Entering calico_tc_skb_accepted
<idle>-0 [001] .Ns. 155578.474709: 0: eth0-----I: src=b2a6310c dst=bc22be20
<idle>-0 [001] .Ns. 155578.474710: 0: eth0-----I: post_nat=0:0
<idle>-0 [001] .Ns. 155578.474711: 0: eth0-----I: tun_ip=0
<idle>-0 [001] .Ns. 155578.474712: 0: eth0-----I: pol_rc=0
<idle>-0 [001] .Ns. 155578.474712: 0: eth0-----I: sport=56360
<idle>-0 [001] .Ns. 155578.474713: 0: eth0-----I: flags=0
<idle>-0 [001] .Ns. 155578.474714: 0: eth0-----I: ct_rc=5
<idle>-0 [001] .Ns. 155578.474714: 0: eth0-----I: ct_related=0
<idle>-0 [001] .Ns. 155578.474715: 0: eth0-----I: ip->ttl 53
<idle>-0 [001] .Ns. 155578.474717: 0: eth0-----I: CT: DNAT to a70cb76:22
<idle>-0 [001] .Ns. 155578.474718: 0: eth0-----I: CT says encap to node c0a80a04
<idle>-0 [001] .Ns. 155578.474719: 0: eth0-----I: SKB too long (len=1578) vs limit=1400
<idle>-0 [001] .Ns. 155578.474720: 0: eth0-----I: Request packet with DNF set is too big
<idle>-0 [001] .Ns. 155578.474721: 0: eth0-----I: Entering calico_tc_skb_send_icmp_replies
<idle>-0 [001] .Ns. 155578.474722: 0: eth0-----I: ICMP type 3 and code 4
<idle>-0 [001] .Ns. 155578.474723: 0: eth0-----I: IP id=2882 s=b2a6310c d=bc22be20
<idle>-0 [001] .Ns. 155578.474724: 0: eth0-----I: ip->ihl: 5
<idle>-0 [001] .Ns. 155578.474725: 0: eth0-----I: Trimming to 118
<idle>-0 [001] .Ns. 155578.474741: 0: eth0-----I: Inserting 28
<idle>-0 [001] .Ns. 155578.474743: 0: eth0-----I: Len after insert 146
<idle>-0 [001] .Ns. 155578.474744: 0: eth0-----I: IP id=0 s=0 d=0
<idle>-0 [001] .Ns. 155578.474746: 0: eth0-----I: ICMP v4 reply creation succeeded
<idle>-0 [001] .Ns. 155578.474747: 0: eth0-----I: IP id=0 s=c0a80a03 d=b2a6310c
<idle>-0 [001] .Ns. 155578.474748: 0: eth0-----I: Traffic is towards host namespace, marking with c3400000.
<idle>-0 [001] .Ns. 155578.474750: 0: eth0-----I: Final result=ALLOW (0). Program execution time: 58209ns
<idle>-0 [001] .Ns. 155578.474794: 0: eth0-----E: New packet at ifindex=2; mark=c3400000
<idle>-0 [001] .Ns. 155578.474796: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
<idle>-0 [001] .Ns. 155578.474797: 0: eth0-----E: ICMP; type=3 code=4
<idle>-0 [001] .Ns. 155578.474799: 0: eth0-----E: CT-1 lookup from c0a80a03:0
<idle>-0 [001] .Ns. 155578.474800: 0: eth0-----E: CT-1 lookup to b2a6310c:0
<idle>-0 [001] .Ns. 155578.474802: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
<idle>-0 [001] .Ns. 155578.474803: 0: eth0-----E: CT-ICMP: proto 6
<idle>-0 [001] .Ns. 155578.474804: 0: eth0-----E: CT-1 related lookup from b2a6310c:56360
<idle>-0 [001] .Ns. 155578.474805: 0: eth0-----E: CT-1 related lookup to bc22be20:2412
<idle>-0 [001] .Ns. 155578.474806: 0: eth0-----E: CT-1 Hit! NAT FWD entry, doing secondary lookup.
<idle>-0 [001] .Ns. 155578.474824: 0: eth0-----E: CT-1 fwd tun_ip:c0a80a04
<idle>-0 [001] .Ns. 155578.474828: 0: eth0-----E: CT-1 result: 2
<idle>-0 [001] .Ns. 155578.474829: 0: eth0-----E: CT-1 result: related
<idle>-0 [001] .Ns. 155578.474831: 0: eth0-----E: conntrack entry flags 0x4
<idle>-0 [001] .Ns. 155578.474832: 0: eth0-----E: CT Hit
<idle>-0 [001] .Ns. 155578.474834: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
<idle>-0 [001] .Ns. 155578.474834: 0: eth0-----E: Entering calico_tc_skb_accepted
<idle>-0 [001] .Ns. 155578.474836: 0: eth0-----E: src=c0a80a03 dst=b2a6310c
<idle>-0 [001] .Ns. 155578.474837: 0: eth0-----E: post_nat=0:0
<idle>-0 [001] .Ns. 155578.474838: 0: eth0-----E: tun_ip=0
<idle>-0 [001] .Ns. 155578.474839: 0: eth0-----E: pol_rc=0
<idle>-0 [001] .Ns. 155578.474840: 0: eth0-----E: sport=0
<idle>-0 [001] .Ns. 155578.474842: 0: eth0-----E: flags=0
<idle>-0 [001] .Ns. 155578.474843: 0: eth0-----E: ct_rc=2
<idle>-0 [001] .Ns. 155578.474844: 0: eth0-----E: ct_related=1
<idle>-0 [001] .Ns. 155578.474845: 0: eth0-----E: ip->ttl 63
<idle>-0 [001] .Ns. 155578.474847: 0: eth0-----E: IP id=0 s=c0a80a03 d=b2a6310c
<idle>-0 [001] .Ns. 155578.474850: 0: eth0-----E: Final result=ALLOW (0). Program execution time: 53407ns
This is the problem https://github.com/projectcalico/felix/blob/master/bpf-gpl/icmp.h#L112-L113
Iirc the meaning of HOST_IP
changed to "IP of the node int he cluster" and thus it is not correct in a multi-NIC scenario.
Can the new dataplane (VPP) help in this situation?
I'm running a 3 node cluster (1 master and 2 worker nodes) with all of them connected via a private network
192.168.10.0/24
with mtu set on1450
(it's a VPS running on Hetzner). All 3 nodes have a public facing interface with a public IP address. all communication between is done via private lan.The current test i'm doing is deploying a sftp-server on one worker node and upload a file to it (connecting via the public interface) and check the upload speed when the connection is made for each of the worker nodes, repeating the test on eBPF and non-eBPF modes.
Expected Behavior
I expected to see some performance penalty when using eBPF but not as much as observed.
Current Behavior
When connecting to the node that doesn't have the pod running, I see this: eBPF speeds:
Downloads/android-studio-ide-183.5692245-linux.tar.gz 0% 2496KB 67.9KB/s 4:20:01 ETA
non-eBPF speeds:
Downloads/android-studio-ide-183.5692245-linux.tar.gz 6% 66MB 16.3MB/s 00:59 ETA
When connecting to the same node where the pod is running, speeds are almost the same on both modes (around 13/18 MB/s)
Possible Solution
Don't really know :(
Steps to Reproduce (for bugs)
I think that just setting up the same environment should do it
Context
I was truly expecting some performance penalty, but not on this orders of magnitude. 16MB/s -> 60KB/s The main reason for using eBPF was for maintaining the source IP when connection from external sources so that I can have services on the cluster that actually do something useful with that information (anti-spam for example), but since most of my services don't scale and only run one instance, i setup a DNS pointing to both nodes of the cluster and just let the cluster handle where the pod actually is, avoiding the dns-vudu and would have to be made to always have the correct (public) ip of the correct node running the pod. This might not even be a bug, but a consequence of how eBPF works and in that case, i'll just have to drop the requirement of having the source ip available.
Heres my felixconfiguration:
Nodes info:
To make the change I just patch felixconfiguration and set
bpfEnabled
true or false and enable or disable kube-proxy. After the changes restart all calico-node pods.Your Environment