Router connectivity issue with OpenShiftSDN in 4.6

errordeveloper commented 3 years ago

Describe the bug

Client connections to GCP LB time-out when routed to nodes that don't run router pod. All backend health check are shown as failing in GCP console, hence GCP routes clients to all nodes, which fails due to externalTrafficPolicy: Local in openshift-ingress:service/router-default.

Version

IPI / openshift-install 4.6.0-0.okd-2020-11-27-200126

How reproducible

Create a cluster in GCP with networkType: OpenShiftSDN:

openshift-install create install-config --dir "${CLUSTER_NAME}"
sed -i 's/networkType:\ OVNKubernetes/networkType:\ OpenShiftSDN/' "${CLUSTER_NAME}/install-config.yaml"
openshift-install create cluster --dir "${CLUSTER_NAME}"

Test LB ingress:

LB_IP="${kubectl get  --kubeconfig "${CLUSTER_NAME}/auth/kubeconfig" service -n openshift-ingress router-default -o 'jsonpath={.status.loadBalancer.ingress[].ip}'}"
while true ; do curl --silent --show-error --fail --output /dev/null --connect-timeout 7 "http://${LB_IP}" ; done

The output should be similar to this:

curl: (28) Connection timed out after 7004 milliseconds
curl: (28) Connection timed out after 7004 milliseconds
curl: (22) The requested URL returned error: 503 Service Unavailable
curl: (28) Connection timed out after 7004 milliseconds
curl: (22) The requested URL returned error: 503 Service Unavailable
curl: (22) The requested URL returned error: 503 Service Unavailable
curl: (28) Connection timed out after 7004 milliseconds
curl: (22) The requested URL returned error: 503 Service Unavailable
curl: (28) Connection timed out after 7001 milliseconds
curl: (28) Connection timed out after 7003 milliseconds
curl: (22) The requested URL returned error: 503 Service Unavailable
curl: (28) Connection timed out after 7000 milliseconds

503 is expected since the test is against the LB IP without a hostname or URL, so it doesn't match any routes, but the connection time-outs are the problem.

Known Work-arounds

NB: Both work-arounds imply scaling CVO and ingress operator to zero, i.e.

kubectl scale deployments --namespace openshift-cluster-version cluster-version-operator --replicas=0
kubectl scale --namespace=openshift-ingress-operator ingress-operator --replicas=0

1) Scale openshift-ingress:deployments/router-default to run on all worker nodes, remove CP nodes from LB in GCP console.

2) Setting externalTrafficPolicy: Cluster in openshift-ingress:service/router-default makes health checks pass for CP nodes, and eliminates connection time-outs.

vrutkovs commented 3 years ago

Yup, can reproduce. The cluster install passes, but console won't open. This is probably related to iptables change in Fedora 33

vrutkovs commented 3 years ago

It appears this is happening in UPI baremetal as well - the following bugs are probably duplicates:

LorbusChris commented 3 years ago

This isn't happening with OVNKubernetes, right?

errordeveloper commented 3 years ago

This isn't happening with OVNKubernetes, right?

Yes, only with OpenShiftSDN and Cilium.

ryandawsonuk commented 3 years ago

I believe also affecting AWS

vrutkovs commented 3 years ago

Seems it was caused by https://fedoraproject.org/wiki/Changes/iptables-nft-default

Could anyone check if it works when iptables backed was switched to the legacy implementation via alternatives tool?

vrutkovs commented 3 years ago

This appears to be fixed in latest 4.6 nightly:

$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.okd-2021-01-28-091721   True        False         41s     Cluster version is 4.6.0-0.okd-2021-01-28-091721
$ oc get network -o yaml | grep networkType
networkType: OpenShiftSDN

Could anyone confirm?

jomeier commented 3 years ago

@vrutkovs How can I verify that? Currently I use ovn-kubernetes in my Cluster. Should I change the SDN to OpenShiftSDN in the running cluster and if that succeeds, that will be ok?

vrutkovs commented 3 years ago

Clean install or upgrade from 4.5 to 4.6 would ensure its fixed. I didn't try changing SDN in the cluster - and its apparently not part of the issue

PiotrKlimczak commented 3 years ago

I think I am suffering same or similar issue to https://github.com/openshift/okd/issues/395 which was marked as duplicate of this issue.

In general my upgrade from 4.5.0-0.okd-2020-10-15-235428 to 4.6.0-0.okd-2021-01-23-132511 failed so started searching. Seeing above I have restored my cluster from backup and tried upgrading again from 4.5.0-0.okd-2020-10-15-235428 to 4.6.0-0.okd-2021-01-29-161622, which was nightly build created after comment that it should be solved in nightlies.

Therefore for me it doesn't seem to be fixed, unless I am suffering different issue. To describe my case I have bare metal 4.5 test cluster (3 masters, 4 workers), working perfectly on latest stable 4.5 (using istio and rook/ceph in case if that matters). Upgrade succeeds up to the point when machine-config is updated, where masters are succeeding (while workers got stuck) and then everything starts to misbehave:

Routes are down, even if pods are up
"oc" commands to the cluster (from remote machine) are taking 1s+ to complete (if not fail) and works only with kubeconfig file created during installation and authentication is not possible, probably due to timeouts?
ETCD cluster is all happy and queries are responding immediately
Call to API using curl and key/cert combination works immediately for get node list, but takes long time using oc command

However what makes me thinking my case might be slightly different, is that some of the API calls are failing with 503 error, for example:

narcoticfresh commented 3 years ago

After being stuck for a while (my cluster broke on the 4.6 upgrade); I yesterday tried again to set it up anew using 4.6.0-0.okd-2021-01-23-132511 and specifying OVNKubernetes in my install-config.yaml - I made 2 runs, one using FCOS 32, one using 33..

apiVersion: v1
baseDomain: grv.scbs.ch
compute:
- hyperthreading: Enabled   
  name: worker
  replicas: 0
controlPlane:
  hyperthreading: Enabled   
  name: master 
  replicas: 3
metadata:
  name: okd
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14 
    hostPrefix: 23 
  networkType: OVNKubernetes
  serviceNetwork: 
  - 172.30.0.0/16
platform:
  none: {} 
fips: false 
pullSecret:

I then had the exact same outcome as described in #481 - cluster is up, but openshift-apiserver is not available - subsequently (i guess) console does not even get started..

[core@okd-bootstrap ~]$ oc get co
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                             False       True          True       16h
cloud-credential                           4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
cluster-autoscaler                         4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
config-operator                            4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
console                                                                                                         
csi-snapshot-controller                    4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
dns                                        4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
etcd                                       4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
image-registry                             4.6.0-0.okd-2021-01-23-132511   True        False         True       16h
ingress                                    4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
insights                                   4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
kube-apiserver                             4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
kube-controller-manager                    4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
kube-scheduler                             4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
kube-storage-version-migrator              4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
machine-api                                4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
machine-approver                           4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
machine-config                             4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
marketplace                                4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
monitoring                                                                 False       False         True       16h
network                                    4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
node-tuning                                4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
openshift-apiserver                        4.6.0-0.okd-2021-01-23-132511   False       False         False      16h
openshift-controller-manager               4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
openshift-samples                                                                                               
operator-lifecycle-manager                 4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
operator-lifecycle-manager-catalog         4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
operator-lifecycle-manager-packageserver   4.6.0-0.okd-2021-01-23-132511   False       True          False      6s
service-ca                                 4.6.0-0.okd-2021-01-23-132511   True        False         False      16h
storage                                    4.6.0-0.okd-2021-01-23-132511   True        False         False      16h

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          16h     Unable to apply 4.6.0-0.okd-2021-01-23-132511: an unknown error has occurred: MultipleErrors

I'm on UPI bare metal (VMWare vCloud) and it all worked before with OKD 4.5 and FCOS 32 ;-/ None of the "possible dupes" #395 #435 #414 helped to resolve this.

@vrutkovs it seems to me simply switching to OVNKubernetes doesn't solve it..? maybe it needs a different configuration in install-config.yaml?

gather bootstrap log bundle: log-bundle-20210204080459.tar.gz

vrutkovs commented 3 years ago

I yesterday tried again to set it up anew using 4.6.0-0.okd-2021-01-23-132511 and specifying OVNKubernetes in my install-config.yaml

Please file a new bug for that, that's off-topic to OpenshiftSDN support

PiotrKlimczak commented 3 years ago

@vrutkovs any progress on this? Any ETA? Plan? We are stuck with multiple bare metal clusters on OKD 4.5.

mafna commented 3 years ago

@vrutkovs the same here, still stuck on OKD 4.5. because of this issue.

vrutkovs commented 3 years ago

I can't reproduce this issue, the cluster comes up correctly. Please collect a must-gather if you're still hitting the issue (it might be incomplete)

mafna commented 3 years ago

Here is my must-gather https://easyupload.io/igw5sx.

vrutkovs commented 3 years ago

@trozet any clues how to debug that? OKD 4.5 -> 4.6 upgrade breaks one of openshift-apiserver pods, possibly some iptables problem on F32 -> F33 upgrade?

errordeveloper commented 3 years ago

@PiotrKlimczak I think this is GCP-only, unless you have the load balancer implement in a similar way with a floating IP?

mafna commented 3 years ago

@errordeveloper i have UPI configuration (bare metal), not on GCP...

errordeveloper commented 3 years ago

@mafna never mind, I can see @vrutkovs said this affects UPI too. @PiotrKlimczak I'm taking my comment back.

mafna commented 3 years ago

Do we have a solution for this issue yet?

Reamer commented 3 years ago

Environment: vSphere OKD-Version: 4.6.0-0.okd-2021-02-14-205305 I now have similar problems after upgrading the vm hardware compatibility version to version 18 (previously it was 13)

Reamer commented 3 years ago

This also happens with vm hardware compatibility version 15. EDIT: After a hacky downgrade to 13, the cluster is stable again.

ShinjiX commented 3 years ago

Hello, Reproducing the Same problem with a fresh install of OKD 4.6 and FedoraCoreOS 33.20210201.3.0-live.x86_64 in my vSphere Environnement :

ESXi-6.7.0-20190402001-standard
FedoraCoreOS 33.20210201.3.0-live.x86_64
Client Version: 4.6.0-0.okd-2021-02-14-205305
Server Version: 4.6.0-0.okd-2021-02-14-205305
Kubernetes Version: v1.19.2-1049+f173eb4a83e557-dirty

Virtual Machine where on version 14. Edit the Virtual Machine version to 13.

here is the line(3) in *.vmx file in the directory of each VM directly in ssh console : virtualHW.version = "14"

Now its OK ... Console is up and running ... don't understand why ...

PiotrKlimczak commented 3 years ago

Interesting, I am on VMware with hw compat 14 too, will try to downgrade and update the cluster over the weekend to see if it fixes the problem for me.

ShinjiX commented 3 years ago

Hello ;)

I have destroy all my cluster (after downgrade virtualHW.version to 13)

I did a fresh install with new VMs on "virtualHW.version=14" with the excatly the same package version of all composant.

I have juste change the "networkType" to "OVNKubernetes" in my install-config file like this :

apiVersion: v1
baseDomain: okd.local
metadata:
  name: lab

compute:
- hyperthreading: Enabled
  name: worker
  replicas: 0

controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3

networking:
  clusterNetwork:
  - cidr: X.X.X.X/X
    hostPrefix: 23
  networkType: OVNKubernetes
  serviceNetwork:
  - X.X.X.X/16

platform:
  none: {}

fips: false

And Now All is Good Up And Running.

In conclusion there is a link between "Machine Virtual Hardware version" and "networkType" When you combine "Machine Virtual Hardware version" > 13 and "networkType: OpenShiftSDN"

Can't explain why ^^' seems weird ;)

martypab commented 3 years ago

Same issue here when upgrading from okd 4.5 to 4.6.0-0.okd-2021-01-23-132511. (Vsphere). Had some vm with hardvare version 14 and others with 15. Reverted all of them to HW Version 13 and upgrade to 4.6 finished sucessfully.

PiotrKlimczak commented 3 years ago

Can confirm upgrade successful after downgrading to hw 13. Any idea why newer versions are breaking?

vrutkovs commented 3 years ago

Any idea why newer versions are breaking?

OpenshiftSDN bug - https://bugzilla.redhat.com/show_bug.cgi?id=1935591 and https://bugzilla.redhat.com/show_bug.cgi?id=1935539

jcpowermac commented 3 years ago

Why are new versions breaking? Virtual hardware version 14 and greater enables geneve and vxlan tunnel offload. Take a look at:

https://github.com/torvalds/linux/commit/dacce2be33124df3c71f979ac47e3d6354a41125#diff-db4c3dfb5fede7bacdecc2e2c486cb29369c21885ffa6ccb6cd4220c37b0fa75

I left this comment in the BZ but figured I might get more traction out here.

We have multiple vSphere clusters and different versions in each. The physical clusters are unable to reproduce the problem. I have an additional nested cluster where I can reproduce.

It would be great if we could get some additional information. If those impacted could tell us:

1.) ESXi version w/build numbers 2.) Type of switch used (Standard, Distributed, NSX-T Opaque) 3.) Switch security/policy - Promiscuous mode, MAC address changes, Forged transmits 4.) CPU model (e.g. Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz) 5.) virtual hardware version (we know it must be past 14)

If there is some thing interesting or unique about your vSphere cluster please note it. Oh and I don't care if its your homelab please let us know the details ;-)

Reamer commented 3 years ago

Information about my test cluster where I saw this problem. At the moment, the cluster is running with OVNKubernetes. 1) VMware ESXi, 7.0.1, 17551050 2) Type of switch: Distributed 3) Switch security/policy:

Promiscuous mode: not allowed
MAC address changes: not allowed
Forged transmits: not allowed 4) Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz 5) I have seen this routing behaviour with hardware version 15. I have not tested hardware version 14 myself.

If the network security policy has not been set correctly, I would be very grateful for a hint.

jcpowermac commented 3 years ago

@Reamer thanks for the information. This afternoon I was able to reproduce the issue and the VMware folks reviewed the logs.

Snipit from the bz which I don't think is public.

"Vmxnet3 supports 8472 (when NSX is not installed) and 4789 (when NSX is installed). Any other destination port number won't work if guest overlay offload is to be used. Vmxnet3 plans to support different port numbers as configured by the user, but as of now it does not. So, the workarounds are either change the destination port or disable the tunnel segmentation offload from the guest."

We had already moved forward to disable the tunnel segmentation offload: https://github.com/openshift/machine-config-operator/pull/2482/files

VMware is also producing a KB article. I will post once I have it.

jomeier commented 3 years ago

@vrutkovs Is it possible to cherry pick this PRs to OKD 4.6 and 4.7 ?

vrutkovs commented 3 years ago

Both (merged in single https://github.com/openshift/machine-config-operator/pull/2495 for release-4.7) should be included in https://amd64.origin.releases.ci.openshift.org/releasestream/4.7.0-0.okd/release/4.7.0-0.okd-2021-04-11-205428, could anyone give it a try?

PiotrKlimczak commented 3 years ago

Just updated my cluster to latest 4.7 (7 days old release) which should have above fix. Unfortunately newly added compute machines with hw level 17 (esxi 7) still fails. Compute node joins cluster correctly, however inter host comms are failing. Pod health checks are also failing.

jcpowermac commented 3 years ago

@PiotrKlimczak was your cluster installed with platform: vsphere or platform: none?

PiotrKlimczak commented 3 years ago

Platform: none. Is that the reason? If so can I switch platform somehow?

jcpowermac commented 3 years ago

Platform: none. Is that the reason? If so can I switch platform somehow?

Yep - https://github.com/openshift/machine-config-operator/pull/2559

Switching platforms isn't supported at the moment.

Reamer commented 3 years ago

With platform: vsphere I can confirm that this problem is solved. I use vSphere with hardware version 15.

Frewx commented 3 years ago

I got the same error with versions below:

OKD: 4.7.0-0.okd-2021-03-07-090821
vSphere: 7.0 U1 HW version 18
platfrom: none

With the updated configuration below, I resolved this issue.

OKD: 4.7.0-0.okd-2021-04-24-103438
vSphere: 7.0 U1 HW version 18
platform: vsphere

NgHuuAn commented 1 year ago

I got the same error while using openshifSDN with Baremetal UPI, Openshift version 4.13 installation ... platform: none: {} ...

okd-project / okd

Router connectivity issue with OpenShiftSDN in 4.6 #430