okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.72k stars 295 forks source link

Pod to Pod Communcation severely degraded in 4.11 on vSphere #1550

Closed MattPOlson closed 1 year ago

MattPOlson commented 1 year ago

Describe the bug

We run okd in a vSphere environment with the below configuration:

vSphere:
ESXi version: 7.0 U3e
Seperate vDS (on version 6.5) for Front End and iSCSI

Hardware:
UCS B200-M4 Blade
    BIOS - B200M4.4.1.2a.0.0202211902
    Xeon(R) CPU E5-2667
    2 x 20Gb Cisco UCS VIC 1340 network adapter for front end connectivity (Firmware 4.5(1a))
    2 x 20Gb Cisco UCS VIC 1340 network adapter for iSCSI connectivity (Firmware 4.5(1a))

Storage:
Compellent SC4020 over iSCSI
    2 controller array with dual iSCSI IP connectivity (2 paths per LUN)
All cluster nodes on same Datastore

After upgrading the cluster from a 4.10.x version to anything above 4.11.x pod to pod communication is severely degraded where the nodes that the pods run on are hosted on different esx hosts. We ran a benchmark test on the cluster before the upgrade with the below results:

Benchmark Results

Name : knb-2672
Date : 2023-03-29 15:26:01 UTC
Generator : knb
Version : 1.5.0
Server : k8s-se-internal-01-582st-worker-n2wtp
Client : k8s-se-internal-01-582st-worker-cv7cd
UDP Socket size : auto

Discovered CPU : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Discovered Kernel : 5.18.5-100.fc35.x86_64
Discovered k8s version : v1.23.5-rc.0.2076+8cfebb1ce4a59f-dirty
Discovered MTU : 1400
Idle :
bandwidth = 0 Mbit/s
client cpu = total 12.31% (user 9.41%, nice 0.00%, system 2.83%, iowait 0.07%, steal 0.00%)
server cpu = total 9.04% (user 6.28%, nice 0.00%, system 2.74%, iowait 0.02%, steal 0.00%)
client ram = 4440 MB
server ram = 3828 MB
Pod to pod :
TCP :
bandwidth = 6306 Mbit/s
client cpu = total 26.15% (user 5.19%, nice 0.00%, system 20.96%, iowait 0.00%, steal 0.00%)
server cpu = total 29.39% (user 8.13%, nice 0.00%, system 21.26%, iowait 0.00%, steal 0.00%)
client ram = 4460 MB
server ram = 3820 MB
UDP :
bandwidth = 1424 Mbit/s
client cpu = total 26.08% (user 7.21%, nice 0.00%, system 18.82%, iowait 0.05%, steal 0.00%)
server cpu = total 24.82% (user 6.72%, nice 0.00%, system 18.05%, iowait 0.05%, steal 0.00%)
client ram = 4444 MB
server ram = 3824 MB
Pod to Service :
TCP :
bandwidth = 6227 Mbit/s
client cpu = total 27.90% (user 5.12%, nice 0.00%, system 22.73%, iowait 0.05%, steal 0.00%)
server cpu = total 29.85% (user 5.86%, nice 0.00%, system 23.99%, iowait 0.00%, steal 0.00%)
client ram = 4439 MB
server ram = 3811 MB
UDP :
bandwidth = 1576 Mbit/s
client cpu = total 32.31% (user 6.41%, nice 0.00%, system 25.90%, iowait 0.00%, steal 0.00%)
server cpu = total 26.12% (user 5.68%, nice 0.00%, system 20.39%, iowait 0.05%, steal 0.00%)
client ram = 4449 MB
server ram = 3818 MB

After upgrading to version 4.11.0-0.okd-2023-01-14-152430 the latency between the pods is so high the benchmark test, qperf test, and iperf test all timeout and fail to run. This is the result of curling the network check pod across nodes, it takes close to 30 seconds.

sh-4.4# time curl http://10.129.2.44:8080
Hello, 10.128.2.2. You have reached 10.129.2.44 on k8s-se-internal-01-582st-worker-cv7cd
real    0m26.496s

We have been able to reproduce this issue consistently on multiple different clusters.

Version 4.11.0-0.okd-2023-01-14-152430 IPI on vSphere

How reproducible Upgrade or install a 4.11.x or higher version of OKD and observe the latency.

MattPOlson commented 1 year ago

I had tried systemctl restart NetworkManager on one node after your message without thinking further. This breaks the SSH connection and kills the command probably because of the missing parent process. I had to reset the node manually. I have not found anything to open any kind of tmux or screen session in FedoraCoreOS.

I can confirm that the offload parameters are also set in my environment.

[root@worker1-cl1-dc3 ~]# ethtool -k ens192 | grep tx-udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: off [fixed]

I ran network performance tests using iperf before and after changing the offload parameters. I used ethtool for changing the offload parameter.

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

The difference between the tests is tiny. The network speed between two pods on different nodes and two VMs is very large (between two VMs the speed is around 7x faster), but according to my current knowledge this is due to OVN. I did not notice any network disconnections.

Running this command fixes the issues for us:

restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B;systemctl restart NetworkManager;systemctl restart kubelet

With offload on communication between nodes on different nodes is really bad. I have also found that if we upgrade the vSphere Distributed Switch to version 7.0.3 the problem goes away, speeds are normal with offload on.