Pod to Pod Communcation severely degraded in 4.11 on vSphere

MattPOlson commented 1 year ago

Describe the bug

We run okd in a vSphere environment with the below configuration:

vSphere:
ESXi version: 7.0 U3e
Seperate vDS (on version 6.5) for Front End and iSCSI

Hardware:
UCS B200-M4 Blade
    BIOS - B200M4.4.1.2a.0.0202211902
    Xeon(R) CPU E5-2667
    2 x 20Gb Cisco UCS VIC 1340 network adapter for front end connectivity (Firmware 4.5(1a))
    2 x 20Gb Cisco UCS VIC 1340 network adapter for iSCSI connectivity (Firmware 4.5(1a))

Storage:
Compellent SC4020 over iSCSI
    2 controller array with dual iSCSI IP connectivity (2 paths per LUN)
All cluster nodes on same Datastore

After upgrading the cluster from a 4.10.x version to anything above 4.11.x pod to pod communication is severely degraded where the nodes that the pods run on are hosted on different esx hosts. We ran a benchmark test on the cluster before the upgrade with the below results:

Benchmark Results

Name : knb-2672
Date : 2023-03-29 15:26:01 UTC
Generator : knb
Version : 1.5.0
Server : k8s-se-internal-01-582st-worker-n2wtp
Client : k8s-se-internal-01-582st-worker-cv7cd
UDP Socket size : auto

Discovered CPU : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Discovered Kernel : 5.18.5-100.fc35.x86_64
Discovered k8s version : v1.23.5-rc.0.2076+8cfebb1ce4a59f-dirty
Discovered MTU : 1400
Idle :
bandwidth = 0 Mbit/s
client cpu = total 12.31% (user 9.41%, nice 0.00%, system 2.83%, iowait 0.07%, steal 0.00%)
server cpu = total 9.04% (user 6.28%, nice 0.00%, system 2.74%, iowait 0.02%, steal 0.00%)
client ram = 4440 MB
server ram = 3828 MB
Pod to pod :
TCP :
bandwidth = 6306 Mbit/s
client cpu = total 26.15% (user 5.19%, nice 0.00%, system 20.96%, iowait 0.00%, steal 0.00%)
server cpu = total 29.39% (user 8.13%, nice 0.00%, system 21.26%, iowait 0.00%, steal 0.00%)
client ram = 4460 MB
server ram = 3820 MB
UDP :
bandwidth = 1424 Mbit/s
client cpu = total 26.08% (user 7.21%, nice 0.00%, system 18.82%, iowait 0.05%, steal 0.00%)
server cpu = total 24.82% (user 6.72%, nice 0.00%, system 18.05%, iowait 0.05%, steal 0.00%)
client ram = 4444 MB
server ram = 3824 MB
Pod to Service :
TCP :
bandwidth = 6227 Mbit/s
client cpu = total 27.90% (user 5.12%, nice 0.00%, system 22.73%, iowait 0.05%, steal 0.00%)
server cpu = total 29.85% (user 5.86%, nice 0.00%, system 23.99%, iowait 0.00%, steal 0.00%)
client ram = 4439 MB
server ram = 3811 MB
UDP :
bandwidth = 1576 Mbit/s
client cpu = total 32.31% (user 6.41%, nice 0.00%, system 25.90%, iowait 0.00%, steal 0.00%)
server cpu = total 26.12% (user 5.68%, nice 0.00%, system 20.39%, iowait 0.05%, steal 0.00%)
client ram = 4449 MB
server ram = 3818 MB

After upgrading to version 4.11.0-0.okd-2023-01-14-152430 the latency between the pods is so high the benchmark test, qperf test, and iperf test all timeout and fail to run. This is the result of curling the network check pod across nodes, it takes close to 30 seconds.

sh-4.4# time curl http://10.129.2.44:8080
Hello, 10.128.2.2. You have reached 10.129.2.44 on k8s-se-internal-01-582st-worker-cv7cd
real    0m26.496s

We have been able to reproduce this issue consistently on multiple different clusters.

Version 4.11.0-0.okd-2023-01-14-152430 IPI on vSphere

How reproducible Upgrade or install a 4.11.x or higher version of OKD and observe the latency.

rvanderp3 commented 1 year ago

What is the VMware hardware version of the VMs?

MattPOlson commented 1 year ago

They are: ESXi 6.7 U2 and later (VM version 15)

vrutkovs commented 1 year ago

Is it reproducible in 4.12?

MattPOlson commented 1 year ago

Yes we upgraded a cluster to 4.12 and were able to reproduce it.

vrutkovs commented 1 year ago

Right, so its possibly kernel module or OVN have regressed. Could you check if node-to-node performance has degraded too? If yes, its probably a Fedora / kernel regression

MattPOlson commented 1 year ago

Node to node performance is good, I tested on the nodes themselves using the toolbox

[  1] local 10.33.154.189 port 32934 connected with 10.33.154.187 port 5001 (icwnd/mss/irtt=14/1448/241)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.01 sec  8.01 GBytes  6.87 Gbits/sec

vrutkovs commented 1 year ago

In that case its probably OVN. I wonder if we could confirm its OKD specific?

MattPOlson commented 1 year ago

We have a few OCP cluster at 4.11 and I haven't been able to reproduce the problem in them.

ptudor commented 1 year ago

When you jump from 4.11 (4.11.0-0.okd-2022-08-20-022919 or so) to 4.12.0-0.okd-2023-02-18-033438 or newer you should see better network performance.

MattPOlson commented 1 year ago

When you jump from 4.11 (4.11.0-0.okd-2022-08-20-022919 or so) to 4.12.0-0.okd-2023-02-18-033438 or newer you should see better network performance.

That is not the case for us, I have upgraded to version 4.12.0-0.okd-2023-04-01-051724 and am still seeing the same issues. I still can't run any test across pods without them timing out.

MattPOlson commented 1 year ago

Do we need to open an issue in the ovn-kubernetes repository? How do I figure out what release of ovn-kubernetes is in a specific version of okd?

Reamer commented 1 year ago

I think you should open an issue in the repository https://github.com/openshift/ovn-kubernetes/. Depending on your okd version you are using, you should select the branch in https://github.com/openshift/ovn-kubernetes/ to see the current source code of ovn-kubernetes.

MattPOlson commented 1 year ago

We re-deployed the cluster with version 4.10.0-0.okd-2022-07-09-073606 on the same hardware and the issue went away. There is clearly an issue with 4.11 and above. Benchmark results are below:


=========================================================
 Benchmark Results
=========================================================
 Name            : knb-17886
 Date            : 2023-04-10 19:46:01 UTC
 Generator       : knb
 Version         : 1.5.0
 Server          : k8s-se-platform-01-t4fb6-worker-vw2d9
 Client          : k8s-se-platform-01-t4fb6-worker-jk2wm
 UDP Socket size : auto
=========================================================
  Discovered CPU         : Intel(R) Xeon(R) Gold 6334 CPU @ 3.60GHz
  Discovered Kernel      : 5.18.5-100.fc35.x86_64
  Discovered k8s version : v1.23.5-rc.0.2076+8cfebb1ce4a59f-dirty
  Discovered MTU         : 1400
  Idle :
    bandwidth = 0 Mbit/s
    client cpu = total 4.06% (user 2.17%, nice 0.00%, system 1.82%, iowait 0.07%, steal 0.00%)
    server cpu = total 2.96% (user 1.48%, nice 0.00%, system 1.48%, iowait 0.00%, steal 0.00%)
    client ram = 925 MB
    server ram = 1198 MB
  Pod to pod :
    TCP :
      bandwidth = 8348 Mbit/s
      client cpu = total 26.07% (user 1.78%, nice 0.00%, system 24.27%, iowait 0.02%, steal 0.00%)
      server cpu = total 26.59% (user 1.94%, nice 0.00%, system 24.63%, iowait 0.02%, steal 0.00%)
      client ram = 930 MB
      server ram = 1196 MB
    UDP :
      bandwidth = 1666 Mbit/s
      client cpu = total 19.21% (user 2.14%, nice 0.00%, system 17.02%, iowait 0.05%, steal 0.00%)
      server cpu = total 22.51% (user 2.91%, nice 0.00%, system 19.55%, iowait 0.05%, steal 0.00%)
      client ram = 924 MB
      server ram = 1201 MB
  Pod to Service :
    TCP :
      bandwidth = 8274 Mbit/s
      client cpu = total 26.55% (user 1.78%, nice 0.00%, system 24.77%, iowait 0.00%, steal 0.00%)
      server cpu = total 26.37% (user 2.67%, nice 0.00%, system 23.68%, iowait 0.02%, steal 0.00%)
      client ram = 922 MB
      server ram = 1191 MB
    UDP :
      bandwidth = 1635 Mbit/s
      client cpu = total 20.19% (user 1.60%, nice 0.00%, system 18.54%, iowait 0.05%, steal 0.00%)
      server cpu = total 21.80% (user 2.82%, nice 0.00%, system 18.98%, iowait 0.00%, steal 0.00%)
      client ram = 913 MB
      server ram = 1179 MB
=========================================================

=========================================================
qperf
======================================================

/ # qperf 10.130.2.15 tcp_bw tcp_lat
tcp_bw:
    bw  =  907 MB/sec
tcp_lat:
    latency  =  70.6 us
/ # qperf 10.130.2.15 tcp_bw tcp_lat
tcp_bw:
    bw  =  1 GB/sec
tcp_lat:
    latency  =  68.2 us

===

MattPOlson commented 1 year ago

I tested this on a cluster using openshiftSDN, deployed version 4.10 upgraded to 4.11 and was able to replicate the issue. So, it's not specific to OVN.

imdmahajankanika commented 1 year ago

I tested this on a cluster using openshiftSDN, deployed version 4.10 upgraded to 4.11 and was able to replicate the issue. So, it's not specific to OVN.

So, the issue #1563 for 4.11.0-0.okd-2022-12-02-145640 is reproducible that case

jcpowermac commented 1 year ago

Version and build numbers of ESXi please
FCOS kernel version
Have you tried a test where all the OKD nodes are on the same physical esxi host?

imdmahajankanika commented 1 year ago

Version and build numbers of ESXi please

FCOS kernel version

Have you tried a test where all the OKD nodes are on the same physical esxi host?

Yes, for https://github.com/okd-project/okd/issues/1563, I checked that all the nodes (except the remote worker nodes) master, storage and worker are already on the same esx host (ESXi 6.7 and later (VM version 14))

jcpowermac commented 1 year ago

I will try to reproduce here but it would be good to know if I am replicating to what was already provisioned. Again can I get ESXi version and build numbers, FCOS kernel version - please be specific.

Remember vSphere 6.x is EOL and some older versions have issues with VXLAN w/ESXi and kernel drivers.

MattPOlson commented 1 year ago

I will try to reproduce here but it would be good to know if I am replicating to what was already provisioned. Again can I get ESXi version and build numbers, FCOS kernel version - please be specific.

Remember vSphere 6.x is EOL and some older versions have issues with VXLAN w/ESXi and kernel drivers.

In our case the ESxi version info is in the initial post

Linux version 6.0.18-200.fc36.x86_64 (mockbuild@bkernel01.iad2.fedoraproject.org) (gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4), GNU ld version 2.37-37.fc36) #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:08:48 UTC 2023

vSphere: ESXi version: 7.0 U3e Seperate vDS (on version 6.5) for Front End and iSCSI

jcpowermac commented 1 year ago

okd version: 4.12.0-0.okd-2023-04-01-051724 fcos: Kernel version 6.1.14-200.fc37.x86_64

ESXi: VMware ESXi, 8.0.0, 20513097

client

sh-5.2$ iperf3 -i 5 -t 60 -c  10.129.2.8
Connecting to host 10.129.2.8, port 5201
[  5] local 10.128.2.18 port 48400 connected to 10.129.2.8 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-5.00   sec  3.88 GBytes  6.67 Gbits/sec  301   1.19 MBytes       
[  5]   5.00-10.00  sec  3.77 GBytes  6.49 Gbits/sec  630   1.12 MBytes       
[  5]  10.00-15.00  sec  3.93 GBytes  6.75 Gbits/sec   92   1.67 MBytes       
[  5]  15.00-20.00  sec  3.83 GBytes  6.58 Gbits/sec  400   1.06 MBytes       
[  5]  20.00-25.00  sec  3.22 GBytes  5.54 Gbits/sec  5329   1.02 MBytes       
[  5]  25.00-30.00  sec  3.41 GBytes  5.85 Gbits/sec  184   1.45 MBytes       
^C[  5]  30.00-34.66  sec  3.21 GBytes  5.92 Gbits/sec  874   1.20 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-34.66  sec  25.3 GBytes  6.26 Gbits/sec  7810             sender
[  5]   0.00-34.66  sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated
sh-5.2$

server

Accepted connection from 10.128.2.18, port 48394
[  5] local 10.129.2.8 port 5201 connected to 10.128.2.18 port 48400
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-5.00   sec  3.88 GBytes  6.67 Gbits/sec                  
[  5]   5.00-10.00  sec  3.78 GBytes  6.49 Gbits/sec                  
[  5]  10.00-15.00  sec  3.93 GBytes  6.74 Gbits/sec                  
[  5]  15.00-20.00  sec  3.83 GBytes  6.58 Gbits/sec                  
[  5]  20.00-25.00  sec  3.22 GBytes  5.54 Gbits/sec                  
[  5]  25.00-30.00  sec  3.41 GBytes  5.85 Gbits/sec                  
[  5]  25.00-30.00  sec  3.41 GBytes  5.85 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-30.00  sec  25.3 GBytes  7.23 Gbits/sec                  receiver
iperf3: the client has terminated
-----------------------------------------------------------
Server listening on 5201 (test #3)
-----------------------------------------------------------

jcpowermac commented 1 year ago

Not really seeing a problem here Each pod in the above test was on different FCOS nodes, residing on different physical esxi hosts.

MattPOlson commented 1 year ago

Not really seeing a problem here Each pod in the above test was on different FCOS nodes, residing on different physical esxi hosts.

I can't even get iperf tests to run when the pods are on hosts that are on different esx hosts, it just times out.

jcpowermac commented 1 year ago

MattPOlson commented 1 year ago

In my case I can't even get the console up anymore. I've reproduced it over and over and now it looks like someone else has also. Not sure what to do other than stay at 4.10.

jcpowermac commented 1 year ago

@MattPOlson based on your previous comments this looks to me like MTU or something with VXLAN. Have you checked all the virtual switches, physical device MTU?

And is it correct in stating with all the guests reside together there is no performance issue? Is it a specific ESXi host that is ok?

MattPOlson commented 1 year ago

@MattPOlson based on your previous comments this looks to me like MTU or something with VXLAN. Have you checked all the virtual switches, physical device MTU?

And is it correct in stating with all the guests reside together is no performance issue? Is it a specific ESXi host that is ok?

We've tried MTU settings, set them to match the host. But why would that affect 4.11 and not 4.10? I can spin up a cluster on 4.10 and it works perfectly, upgrade it to 4.11 and change nothing else and it breaks.

And yea if all the nodes reside together there is no issue.

jcpowermac commented 1 year ago

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

MattPOlson commented 1 year ago

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

Agree but I also find it odd that upgrading to 4.11 breaks it and someone else was able to reproduce it. To me that feels like it's not something specific to our environment.

jcpowermac commented 1 year ago

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

Agree but I also find it odd that upgrading to 4.11 breaks it and someone else was able to reproduce it. To me that feels like it's not something specific to our environment.

We do test OCP and OKD in multiple different vSphere environments and haven't seen this issue. Maybe you and @imdmahajankanika stumbled into the same problem?

The question is what is the commonality.

MattPOlson commented 1 year ago

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

Agree but I also find it odd that upgrading to 4.11 breaks it and someone else was able to reproduce it. To me that feels like it's not something specific to our environment.

We do test OCP and OKD in multiple different vSphere environments and haven't seen this issue. Maybe you and @imdmahajankanika stumbled into the same problem?

The question is what is the commonality.

Right, that is the question.

Interestingly we have a few OCP clusters running at 4.11 on the exact same hardware and don't see the issue there.

bo0ts commented 1 year ago

After upgrading our clusters from 4.10.0-0.okd-2022-07-09-073606 to 4.11.0-0.okd-2023-01-14-152430 connectivity betwenn kube-apiserver and all other apiservers was lost. Our master nodes all run on VSphere. We could fix the issue by running:

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

on the master nodes. I have a feeling that we are looking at the very old bug: https://github.com/openshift/machine-config-operator/pull/2482

Could you check the state of tunnel offloading on your nodes with ethtool -k <your-primary-interface> | grep tx-udp?

jcpowermac commented 1 year ago

That shouldn't be an issue as we haven't removed the workaround

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

And it is in specific older ESXi versions, if you are hitting the VXLAN offloading bug you need to upgrade your hosts.

MattPOlson commented 1 year ago

That shouldn't be an issue as we haven't removed the workaround

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

And it is in specific older ESXi versions, if you are hitting the VXLAN offloading bug you need to upgrade your hosts.

I thnk I figured it out, your workaround isn't working anymore, looks like there is a permission issue and the NetworkManager-dispatcher.service is failing to apply the scripts in the directory /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl

Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:12 'connectivity-change': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:13 'up' [ens192]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:14 'up' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:15 'pre-up' [ens192]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d/pre-up.d': Error opening directory “/etc/NetworkManager/dispatcher.d/pre-up.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:16 'up' [ens192]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:17 'dhcp4-change' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:18 'pre-up' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d/pre-up.d': Error opening directory “/etc/NetworkManager/dispatcher.d/pre-up.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:19 'up' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:20 'connectivity-change': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:58:01 k8s-se-internal-01-582st-master-0 systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.

Disabling tunnel offloading seems to fix the problem. I'm looking into why that script is now getting permission denied errors.

jcpowermac commented 1 year ago

VMware only updated their release notes for 6.7 that resolves this issue. I am unsure what 7.x build fixes it. We are running the close to the latest versions of 7 and 8.

Certainly need to figure out the permission issue, which is strange. I would figure other dispatch scripts would be breaking too.

MattPOlson commented 1 year ago

It looks like this bug was already reported and fixed, but I'm definitely still seeing the issue in our environment

https://github.com/okd-project/okd/issues/1317

vrutkovs commented 1 year ago

Perhaps its also https://github.com/okd-project/okd/issues/1475?

MattPOlson commented 1 year ago

I think I figured something out, the script the service rhcos-selinux-policy-upgrade.service executes to reload selinux is never running because its looking for RHEL_VERSION in /usr/lib/os-release

That exists in Red Hat Enterprise Linux CoreOS but not in Fedora, therefore it's never hitting the line that calls semodule -B

#!/bin/bash
# Executed by rhcos-selinux-policy-upgrade.service
set -euo pipefail

RHEL_VERSION=$(. /usr/lib/os-release && echo ${RHEL_VERSION:-})
echo -n "RHEL_VERSION=${RHEL_VERSION:-}"
case "${RHEL_VERSION:-}" in
  8.[0-6]) echo "Checking for policy recompilation";;
  *) echo "Assuming we have new enough ostree"; exit 0;;
esac

ls -al /{usr/,}etc/selinux/targeted/policy/policy.31
if ! cmp --quiet /{usr/,}etc/selinux/targeted/policy/policy.31; then
    echo "Recompiling policy due to local modifications as workaround for https://bugzilla.redhat.com/2057497"
    semodule -B
fi

 cat . /usr/lib/os-release
cat: .: Is a directory
NAME="Fedora Linux"
VERSION="37.20230303.3.0 (CoreOS)"
ID=fedora
VERSION_ID=37
VERSION_CODENAME=""
PLATFORM_ID="platform:f37"
PRETTY_NAME="Fedora CoreOS 37.20230303.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:37"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=37
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=37
SUPPORT_END=2023-11-14
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='37.20230303.3.0'

bo0ts commented 1 year ago

VMware only updated their release notes for 6.7 that resolves this issue. I am unsure what 7.x build fixes it. We are running the close to the latest versions of 7 and 8.

@jcpowermac We are running on ESXi 7u3 and this issue should be fixed. Maybe this has come up again in newer versions of vmxnet3?

jcpowermac commented 1 year ago

VMware only updated their release notes for 6.7 that resolves this issue. I am unsure what 7.x build fixes it. We are running the close to the latest versions of 7 and 8.

@jcpowermac We are running on ESXi 7u3 and this issue should be fixed. Maybe this has come up again in newer versions of vmxnet3?

@bo0ts I would suggest opening a support request with vmware. They own both aspects of this, the linux kernel driver [0] and ESXi.

[0] - https://github.com/torvalds/linux/commits/master/drivers/net/vmxnet3

kai-uwe-rommel commented 1 year ago

Over in the Slack thread there is also discussion about why and where this occurs. We don't see that problem on our clusters. But perhaps we simply don't do enough intra cluster communications? Can you give us an easy test how I can verify if we indeed do not have the problem or just don't see it?

MattPOlson commented 1 year ago

The easiest way is to deploy an iPerf client on one node and an iPerf server on another node, then run test between them to check performance.

kai-uwe-rommel commented 1 year ago

Ok, I guess something like that: https://github.com/InfuseAI/k8s-iperf

MattPOlson commented 1 year ago

I've had good luck with this one:

https://github.com/InfraBuilder/k8s-bench-suite

jcpowermac commented 1 year ago

FROM quay.io/fedora/fedora:38 
RUN dnf install -y iperf3 ttcp qperf
ENTRYPOINT trap : TERM INT; sleep infinity & wait # Listen for kill signals and exit quickly.

cat Dockerfile| oc new-build --name perf -D -

then created a deployment for both client and server just gotta watch which node it lands on and destroy if necessary

then just oc rsh to run the commands

found this on a blog post somewhere ;-)

imdmahajankanika commented 1 year ago

After upgrading our clusters from 4.10.0-0.okd-2022-07-09-073606 to 4.11.0-0.okd-2023-01-14-152430 connectivity betwenn kube-apiserver and all other apiservers was lost. Our master nodes all run on VSphere. We could fix the issue by running:
ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off
on the master nodes. I have a feeling that we are looking at the very old bug: openshift/machine-config-operator#2482

Could you check the state of tunnel offloading on your nodes with ethtool -k <your-primary-interface> | grep tx-udp?

Hello! In my case, just by executingsystemctl restart NetworkManager, tx-udp_tnl-segmentation and tx-udp_tnl-csum-segmentationgot turned off and the issue resolved.

jcpowermac commented 1 year ago

Hello! In my case, just by executing systemctl restart NetworkManager, tx-udp_tnl-segmentation and tx-udp_tnl-csum segmentation got turned off and the issue resolved.

Disable is done via a networkmanager dispatch script, so that kinda makes sense. Wonder why it doesn't work the first time.

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

imdmahajankanika commented 1 year ago

Hello! In my case, just by executing systemctl restart NetworkManager, tx-udp_tnl-segmentation and tx-udp_tnl-csum segmentation got turned off and the issue resolved.

Disable is done via a networkmanager dispatch script, so that kinda makes sense. Wonder why it doesn't work the first time.

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

When I checked initially via "systemctl status NetworkManager-dispatcher.service", I found two types of errors

Permission denied on /etc/NetworkManager/dispatcher.d folder
Error: Device '' not found. (Device in this case i think is network interface, the variable "DEVICE_IFACE")

Reamer commented 1 year ago

I also see the failed access in my environment. In my opinion, it is due to SeLinux.

May 25 09:25:11 localhost.localdomain NetworkManager[1088]: <info>  [1685006711.9105] manager: (patch-br-ex_worker1-cl1-dc3.s-ocp.cloud.mycompany.com-to-br-int): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/23)
May 25 09:25:11 localhost.localdomain audit[1099]: AVC avc:  denied  { read } for  pid=1099 comm="nm-dispatcher" name="dispatcher.d" dev="sda4" ino=109431351 scontext=system_u:system_r:NetworkManager_dispatcher_t:s0 tcontext=system_u:object_r:NetworkManager_initrc_exec_t:s0 tclass=dir permissive=0
May 25 09:25:11 localhost.localdomain audit[1099]: SYSCALL arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=56017b2615f0 a2=90800 a3=0 items=0 ppid=1 pid=1099 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="nm-dispatcher" exe="/usr/libexec/nm-dispatcher" subj=system_u:system_r:NetworkManager_dispatcher_t:s0 key=(null)
May 25 09:25:11 localhost.localdomain audit: PROCTITLE proctitle="/usr/libexec/nm-dispatcher"
May 25 09:25:11 localhost.localdomain nm-dispatcher[1099]: req:3 'hostname': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied

imdmahajankanika commented 1 year ago

I also see the failed access in my environment. In my opinion, it is due to SeLinux.

May 25 09:25:11 localhost.localdomain NetworkManager[1088]: <info>  [1685006711.9105] manager: (patch-br-ex_worker1-cl1-dc3.s-ocp.cloud.mycompany.com-to-br-int): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/23)
May 25 09:25:11 localhost.localdomain audit[1099]: AVC avc:  denied  { read } for  pid=1099 comm="nm-dispatcher" name="dispatcher.d" dev="sda4" ino=109431351 scontext=system_u:system_r:NetworkManager_dispatcher_t:s0 tcontext=system_u:object_r:NetworkManager_initrc_exec_t:s0 tclass=dir permissive=0
May 25 09:25:11 localhost.localdomain audit[1099]: SYSCALL arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=56017b2615f0 a2=90800 a3=0 items=0 ppid=1 pid=1099 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="nm-dispatcher" exe="/usr/libexec/nm-dispatcher" subj=system_u:system_r:NetworkManager_dispatcher_t:s0 key=(null)
May 25 09:25:11 localhost.localdomain audit: PROCTITLE proctitle="/usr/libexec/nm-dispatcher"
May 25 09:25:11 localhost.localdomain nm-dispatcher[1099]: req:3 'hostname': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied

Hello, Did you try with sudo systemctl restart NetworkManager ?

or

restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B;systemctl restart NetworkManager;systemctl restart kubelet

Reamer commented 1 year ago

I had tried systemctl restart NetworkManager on one node after your message without thinking further. This breaks the SSH connection and kills the command probably because of the missing parent process. I had to reset the node manually. I have not found anything to open any kind of tmux or screen session in FedoraCoreOS.

I can confirm that the offload parameters are also set in my environment.

[root@worker1-cl1-dc3 ~]# ethtool -k ens192 | grep tx-udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: off [fixed]

I ran network performance tests using iperf before and after changing the offload parameters. I used ethtool for changing the offload parameter.

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

The difference between the tests is tiny. The network speed between two pods on different nodes and two VMs is very large (between two VMs the speed is around 7x faster), but according to my current knowledge this is due to OVN. I did not notice any network disconnections.

okd-project / okd

Pod to Pod Communcation severely degraded in 4.11 on vSphere #1550