rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.52k stars 264 forks source link

RKE2 Cluster running Calico seemingly losing UDP traffic when transiting through service IP to remotely located pod #1541

Closed aiyengar2 closed 2 years ago

aiyengar2 commented 3 years ago

Environmental Info: RKE2 Version: v1.21.3-rc3+rke2r2

Node(s) CPU architecture, OS, and Version:

Linux arvind-rke2-1 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 server nodes. Also reproducible on 3 etcd, 1 controlplane, and 3 worker nodes

Describe the bug:

Steps To Reproduce:

Expected behavior:

All nodes should resolve the DNS

Actual behavior:

Only one node (the one that rke2-coredns is running on) resolves the DNS

Additional context / logs:

This issue was diagnosed in https://github.com/rancher/rancher/issues/33052 but reproduced independently of Rancher using the above steps.

aiyengar2 commented 3 years ago

As noted in https://github.com/rancher/rancher/issues/33052#issuecomment-893672355, it seems like this was a regression that was broken, fixed, and then broken once more, possibly due to different versions of RKE2

Oats87 commented 3 years ago

I debugged this with Arvind and we found interesting behavior where UDP DNS queries are unable to be resolved when transiting via the service IP for CoreDNS i.e. 10.43.0.10. If we directly addressed the coredns pod, we could make our DNS queries with no issue. It did not matter whether we were in or out of a pod, i.e. on the node or not.

The DNS service IP 10.43.0.10 worked when CoreDNS was located on the same node as we were testing on.

This is when using the Calico CNI.

This only occurs on Ubuntu 20.04 in our testing. On my CentOS 7 testing boxes, we did not run into this issue.

Oats87 commented 3 years ago

ufw was on but

# ufw status
Status: inactive

For good measure, systemctl disable ufw --now && reboot did not help either.

brandond commented 3 years ago

Does it make any difference if you switch the host iptables between legacy/nftables or uninstall the host iptables+nftables so that we use the embedded ones?

aiyengar2 commented 3 years ago

Does it make any difference if you switch the host iptables between legacy/nftables or uninstall the host iptables+nftables so that we use the embedded ones?

Not sure about this. cc: @Oats87

However, I was able to test this on a v1.21.2+rke2r1 cluster and verify that this issue still exists in that version, so https://github.com/rancher/rke2/issues/1541#issuecomment-893792213 is not accurate.

aiyengar2 commented 3 years ago

In a v1.21.3-rc3+rke2r2 cluster with two Ubuntu 18.04 nodes (as opposed to 20.04 listed above), I was able to reproduce this same behavior on the nodes.

$ uname -a
Linux arvind-ubuntu-1804-0 4.15.0-144-generic #148-Ubuntu SMP Sat May 8 02:33:43 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
aiyengar2 commented 3 years ago

Without specifying cni: calico in the RKE2 cluster (v1.21.3-rc3+rke2r2), the dig call worked perfectly fine on all nodes.

Seems like this is definitely related to Calico, as indicated on the ticket title.

manuelbuil commented 3 years ago

Editing comment. Things seem to work when the pod is the client. The problem comes when the host tries to access the service. This is also happening on v1.21.3+rke2r1

manuelbuil commented 3 years ago

When tracking the packet, I see it going through the correct iptables of kube-proxy:

-A KUBE-SERVICES -d 10.43.0.10/32 -p udp -m comment --comment "kube-system/rke2-coredns-rke2-coredns:udp-53 cluster IP" -m udp --dport 53 -j KUBE-SVC-YFPH5LFNKP7E3G4L

-A KUBE-SVC-YFPH5LFNKP7E3G4L -m comment --comment "kube-system/rke2-coredns-rke2-coredns:udp-53" -j KUBE-SEP-F54GWJZTXPXAPHRS

-A KUBE-SEP-F54GWJZTXPXAPHRS -p udp -m comment --comment "kube-system/rke2-coredns-rke2-coredns:udp-53" -m udp -j DNAT --to-destination 10.42.182.4:53

I can see the packet leaving the node:

18:48:21.980669 IP 10.0.10.14.24169 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.222.64.60066 > 10.42.182.4.53: 61063+ [1au] A? google.com. (51)

And I can see the packet reaching to the other node (the one where coredns is):

IP 10.42.222.64.55933 > 10.42.182.4.53: 5083+ [1au] A? google.com. (51)
18:49:56.672731 IP 10.0.10.14.58192 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096

Then the packet disappears

manuelbuil commented 3 years ago

Sniffing a packet targeting the service. Node with coredns, interface eth0:

19:04:16.499057 IP 10.0.10.14.49959 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.222.64.51417 > 10.42.182.4.53: 29795+ [1au] A? google.com. (51)

Sniffing a packet targeting the pod implementing the service. Node with coredns, interface eth0:

19:04:11.194801 IP 10.0.10.14.34353 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.222.64.46410 > 10.42.182.4.53: 14722+ [1au] A? google.com. (51)
19:04:11.195126 IP 10.0.10.10.51238 > 10.0.10.14.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.182.4.53 > 10.42.222.64.46410: 14722* 1/0/1 A 142.250.178.142 (77)

Sniffing a packet targeting the service. Node with coredns, interface vxlan.calico: nothing Sniffing a packet targeting the pod implementing the service. Node with coredns, interface vxlan.calico:

19:03:32.327538 IP 10.42.222.64.44687 > 10.42.182.4.53: 19939+ [1au] A? google.com. (51)
19:03:32.328436 IP 10.42.182.4.53 > 10.42.222.64.44687: 19939 1/0/1 A 142.250.178.142 (65)
manuelbuil commented 3 years ago

After looking at different things I noticed that when accessing the service directly to the pod, we see this:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 21070, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.50078 > 10.0.10.10.4789: [udp sum ok] VXLAN, flags [I] (0x08), vni 4096

But if we access the service via the clusterIP, we see this:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 16380, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.7103 > 10.0.10.10.4789: [bad udp cksum 0xbd67 -> 0x4e7f!] VXLAN, flags [I] (0x08), vni 4096

Note the bad udp cksum.

After investigating a bit, I read that this is a known kernel bug that was fixed in 5.7. Apparently, the kernel driver miscalculates the checksum when the vxlan offloading is on if the packet is natted, which is our case when accessing the service via the ClusterIP. Centos and RHEL 8 have backported the fix but not Ubuntu, that's why we only see it in Ubuntu (note that Ubuntu 20 uses 5.4.0). This is the kernel fix: https://github.com/torvalds/linux/commit/ea64d8d6c675c0bb712689b13810301de9d8f77a.

Manual fix: Disable the vxlan offloading in the vxlan interface for all nodes: sudo ethtool -K vxlan.calico tx-checksum-ip-generic off. I tested and it works :).

Calico's recommended fix: Calico includes a env variable that when passed to the agent, disables the feature that creates this problem (MASQFullyRandom): https://github.com/projectcalico/calico/issues/3145#issuecomment-815813032. Needs to be tested

TO DO:

manuelbuil commented 3 years ago

Disabling MASQFullyRandom feature does not help. Asking Tigera, perhaps something else must be change. Note that there is a recent PR to fix this on Calico but it does not seem enabled in our version ==> https://github.com/projectcalico/felix/pull/2811

manuelbuil commented 3 years ago

Same issue in opensuse SP3:

10.0.10.9.26831 > 10.0.10.7.4789: [bad udp cksum 0x69a5 -> 0xc8d6!] VXLAN, flags [I] (0x08), vni 4096

Fixed after running sudo ethtool -K vxlan.calico tx-checksum-ip-generic off

vadorovsky commented 3 years ago

@manuelbuil Are you sure that the kernel commit you linked is the only one?

It's applied in SLE 15 SP3 / Leap 15.3 already: https://github.com/SUSE/kernel/commit/3dc74efc615a97b22c1193bff9c22f651651d041

and you seem to have issues on SLE/openSUSE anyway.

manuelbuil commented 3 years ago

@manuelbuil Are you sure that the kernel commit you linked is the only one?

It's applied in SLE 15 SP3 / Leap 15.3 already: SUSE/kernel@3dc74ef

and you seem to have issues on SLE/openSUSE anyway.

I got the link from https://github.com/projectcalico/calico/issues/3145#issuecomment-742845013.

I reported some issues in openSUSE but they were related to a dirty env. Once I freshly deployed, I was able to see the same problem as in Ubuntu

manuelbuil commented 3 years ago

We ended up using the solution that the Tigera guys are doing which basically disables the checksum offload if the kernel is lower than 5.7

manuelbuil commented 3 years ago

For testers:

  1. Deploy rke2 on Ubuntu or SUE 15-SP3 with one control-plane and one worker (or more) using calico as the cni
  2. Run dig @10.43.0.10 www.google.com in all nodes. It should work.
rancher-max commented 3 years ago

Validated this is working in v1.21.3-rc6+rke2r2

Confirmed using ubuntu 20.04 LTS. The dig command works from all server and agent nodes.

rancher-max commented 3 years ago

I closed the wrong issue. This should be validated using master branch. Reopening and closing https://github.com/rancher/rke2/issues/1586 instead.

aiyengar2 commented 3 years ago

This issue seems to have been fixed in v1.21.3-rc5+rke2r2 according to https://github.com/rancher/rancher/issues/33052#issuecomment-896913523 but a regression was observed in v1.21.3-rc6+rke2r2, described in https://github.com/rancher/rancher/issues/33052#issuecomment-902132070

vadorovsky commented 3 years ago

Seems like checking for the 5.7 kernel version is not a sufficient fix. We observed the same issue with UDP checksum offloading on:

There is a possibility to override Calico checks and always set them to false:

featureDetectOverride: "ChecksumOffloadBroken=true"

I think we should do that by default in rke2 as long as we don't find the proper kernel fix.

Speaking of which, I found the following post-5.7 commits which might be worth trying out:

https://github.com/torvalds/linux/commit/0ea460474d70d809eac0640c1cf408ec54e23966 https://github.com/torvalds/linux/commit/527beb8ef9c02c11f8ca0d59fc46f7d081db1c33 https://github.com/torvalds/linux/commit/600af7fd809ad2a307b6c53b2a3e45453a75cdb6 https://github.com/torvalds/linux/commit/4eb5d4a5b4d64bb9495141b2f323caf7524ef8a6 https://github.com/torvalds/linux/commit/8d6bca156e47d68551750a384b3ff49384c67be3 https://github.com/torvalds/linux/commit/000ac44da7d0adfc5e62e6c019246a4afeeffd04

rancher-max commented 3 years ago

Reopening for testing in rke2

manuelbuil commented 3 years ago

@rancher-max apart from doing the dig @10.43.0.10 www.google.com in all nodes, verify that kubectl get felixconfigurations.crd.projectcalico.org default -o yaml gives you this spec:

spec:
  bpfLogLevel: ""
  featureDetectOverride: ChecksumOffloadBroken=true
  logSeverityScreen: Info
  reportingInterval: 0s
  vxlanEnabled: true

We are only passing featureDetectOverride: ChecksumOffloadBroken=true and the rest of parameters should be filled by the operator

rancher-max commented 3 years ago

Leaving this open to validate on 1.22 release line, but confirmed working in v1.21.3-rc7+rke2r2

Validated the dig command works on all nodes, and felixconfigurations are set as mentioned above. Also confirmed running sudo ethtool -k vxlan.calico | grep tx-checksum-ip-generic on all nodes returns expected tx-checksum-ip-generic: off.

galal-hussein commented 2 years ago

Validated on master commit 09bb5c27f29cf3dd831653543d274c948a225385

; <<>> DiG 9.16.1-Ubuntu <<>> @10.43.0.10 google.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5294 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 6c2344a672e1a9de (echoed) ;; QUESTION SECTION: ;google.com. IN A

;; ANSWER SECTION: google.com. 30 IN A 142.250.69.206

;; Query time: 0 msec ;; SERVER: 10.43.0.10#53(10.43.0.10) ;; WHEN: Wed Sep 22 22:35:47 UTC 2021 ;; MSG SIZE rcvd: 77

- make sure that calico is configured correctly

kubectl get felixconfigurations.crd.projectcalico.org default -o yaml

apiVersion: crd.projectcalico.org/v1 kind: FelixConfiguration metadata: annotations: meta.helm.sh/release-name: rke2-calico meta.helm.sh/release-namespace: kube-system projectcalico.org/metadata: '{"uid":"9f051b21-813b-475b-9615-c23692d89279","generation":1,"creationTimestamp":"2021-09-22T22:32:15Z","managedFields":[{"manager":"helm","operation":"Update","apiVersion":"crd.projectcalico.org/v1","time":"2021-09-22T22:32:15Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}},"f:spec":{".":{},"f:featureDetectOverride":{}}}}]}' creationTimestamp: "2021-09-22T22:32:15Z" generation: 3 labels: app.kubernetes.io/managed-by: Helm name: default resourceVersion: "968" uid: 9f051b21-813b-475b-9615-c23692d89279 spec: bpfLogLevel: "" featureDetectOverride: ChecksumOffloadBroken=true logSeverityScreen: Info reportingInterval: 0s vxlanEnabled: true

strelok899 commented 7 months ago

i have the opposite issue. the fix gone to the helm and i trying to disable it to have the hardware offload as i have limited resources and my kube-proxy crashing on lack of cpu issues.

how can i make the offloading work?

brandond commented 7 months ago

Enabling hardware offload will not address issues with insufficient CPU resources. Also, please don't revive old resolved issues to ask unrelated questions, open a new issue or discussion.