Slow intra-network speed with TCP

danielheimburg commented 5 months ago

Bug Report

We are having issues with several nodes on the same LAN. We use PVE as hypervisor and deploy talos onto it. The network is MTU 9000 and works fine, hypervisor to hypervisor full 25Gbit speed. Even talos to talos on the same hypervisor is fast but as soon as we spread out talos on different pve nodes (hypervisors) the intra communication pod to pod is extremely fragmented reaching bandwidth tests of 40Mbit on 25Gbit cards. This is only true for TCP not UDP which gets the full speed, thus we figured MTU is not working correctly and began experimenting.

We found one solution to the problem and that was to apply a network configuration with "talosctl meta write 0xa "$(cat net.yaml)" -n node1 (and node2)" and then, without rebooting talos, killing two pods on each talos node to redeploy them and suddenly the TCP speed is up to par with 10Gbit+.

What makes this even more confusing is that while this resolves the problem temporarily between two pods, after restarting one of the involved nodes (or all nodes) the problem is back. This time with a higher than before MTU setting from the net.yaml config.

Thus my own conclusion is that the pods needs one MTU and perhaps kube-flannel or whatever is behind the network needs another MTU setting. I don't know. What makes this matter worse is that we have tried the exact same setup with different and slower hardware, that works fine.

We are two senior system engineers that have been debugging this for a week now. So we have tried most obvious solutions.

Description

Slow network speeds between pods on LAN with talos linux kubernetes

Logs

I can provide iperf3 logs if needed but I don't see the point. Any other logs I will happily attach upon request.

Environment

Talos version: NODE: 10.13.20.75 Tag: v1.6.2 SHA: 26eee755 Built:
Go version: go1.21.6 X:loopvar OS/Arch: linux/amd64 Enabled: RBAC NODE: 10.13.20.76 Tag: v1.6.2 SHA: 26eee755 Built:
Go version: go1.21.6 X:loopvar OS/Arch: linux/amd64 Enabled: RBAC
Kubernetes version: Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.2
Platform: PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux
Hardware Network cards: Broadcom BCM57414 25Gbit
Hardware Switch Software Mikrotik RouterOS 7.8

andrewrynhard commented 5 months ago

Hi @danielheimburg. I am curious. Have you tried another CNI yet? If not, might I suggest you give Cilium a try?

For two reasons:

We can rule out anything specific to Flannel
We are interested in adopting Cilium more officially. Curious to get your take there.

danielheimburg commented 5 months ago

Hi @andrewrynhard,

After your suggestion to try another CNI, we created a cluster with Cilium and it works! So the issue is most likely Flannel combined with our hardware which seemed like a longshot to me but obviously thats the issue.

I can also report that initial testing shows the performance of Cilium to be better than Flannel. We will deploy our services onto this cluster and test some more but I'm very positive about Cilium which I havent tried before.

Would you consider this to be a Flannel or Linux bug instead of Talos? Can I help with some more information?

siderolabs / talos