siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.04k stars 492 forks source link

Slow intra-network speed with TCP #8215

Open danielheimburg opened 5 months ago

danielheimburg commented 5 months ago

Bug Report

We are having issues with several nodes on the same LAN. We use PVE as hypervisor and deploy talos onto it. The network is MTU 9000 and works fine, hypervisor to hypervisor full 25Gbit speed. Even talos to talos on the same hypervisor is fast but as soon as we spread out talos on different pve nodes (hypervisors) the intra communication pod to pod is extremely fragmented reaching bandwidth tests of 40Mbit on 25Gbit cards. This is only true for TCP not UDP which gets the full speed, thus we figured MTU is not working correctly and began experimenting.

We found one solution to the problem and that was to apply a network configuration with "talosctl meta write 0xa "$(cat net.yaml)" -n node1 (and node2)" and then, without rebooting talos, killing two pods on each talos node to redeploy them and suddenly the TCP speed is up to par with 10Gbit+.

What makes this even more confusing is that while this resolves the problem temporarily between two pods, after restarting one of the involved nodes (or all nodes) the problem is back. This time with a higher than before MTU setting from the net.yaml config.

Thus my own conclusion is that the pods needs one MTU and perhaps kube-flannel or whatever is behind the network needs another MTU setting. I don't know. What makes this matter worse is that we have tried the exact same setup with different and slower hardware, that works fine.

We are two senior system engineers that have been debugging this for a week now. So we have tried most obvious solutions.

Description

Slow network speeds between pods on LAN with talos linux kubernetes

Logs

I can provide iperf3 logs if needed but I don't see the point. Any other logs I will happily attach upon request.

Environment

andrewrynhard commented 5 months ago

Hi @danielheimburg. I am curious. Have you tried another CNI yet? If not, might I suggest you give Cilium a try?

For two reasons:

  1. We can rule out anything specific to Flannel
  2. We are interested in adopting Cilium more officially. Curious to get your take there.
danielheimburg commented 5 months ago

Hi @andrewrynhard,

After your suggestion to try another CNI, we created a cluster with Cilium and it works! So the issue is most likely Flannel combined with our hardware which seemed like a longshot to me but obviously thats the issue.

I can also report that initial testing shows the performance of Cilium to be better than Flannel. We will deploy our services onto this cluster and test some more but I'm very positive about Cilium which I havent tried before.

Would you consider this to be a Flannel or Linux bug instead of Talos? Can I help with some more information?