Closed MattPOlson closed 1 year ago
I had tried
systemctl restart NetworkManager
on one node after your message without thinking further. This breaks the SSH connection and kills the command probably because of the missing parent process. I had to reset the node manually. I have not found anything to open any kind of tmux or screen session in FedoraCoreOS.I can confirm that the offload parameters are also set in my environment.
[root@worker1-cl1-dc3 ~]# ethtool -k ens192 | grep tx-udp tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-udp-segmentation: off [fixed]
I ran network performance tests using iperf before and after changing the offload parameters. I used
ethtool
for changing the offload parameter.ethtool -K ens192 tx-udp_tnl-segmentation off ethtool -K ens192 tx-udp_tnl-csum-segmentation off
The difference between the tests is tiny. The network speed between two pods on different nodes and two VMs is very large (between two VMs the speed is around 7x faster), but according to my current knowledge this is due to OVN. I did not notice any network disconnections.
Running this command fixes the issues for us:
restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B;systemctl restart NetworkManager;systemctl restart kubelet
With offload on communication between nodes on different nodes is really bad. I have also found that if we upgrade the vSphere Distributed Switch to version 7.0.3 the problem goes away, speeds are normal with offload on.
Describe the bug
We run okd in a vSphere environment with the below configuration:
After upgrading the cluster from a 4.10.x version to anything above 4.11.x pod to pod communication is severely degraded where the nodes that the pods run on are hosted on different esx hosts. We ran a benchmark test on the cluster before the upgrade with the below results:
After upgrading to version 4.11.0-0.okd-2023-01-14-152430 the latency between the pods is so high the benchmark test, qperf test, and iperf test all timeout and fail to run. This is the result of curling the network check pod across nodes, it takes close to 30 seconds.
We have been able to reproduce this issue consistently on multiple different clusters.
Version 4.11.0-0.okd-2023-01-14-152430 IPI on vSphere
How reproducible Upgrade or install a 4.11.x or higher version of OKD and observe the latency.