Closed errordeveloper closed 3 years ago
Yup, can reproduce. The cluster install passes, but console won't open. This is probably related to iptables change in Fedora 33
It appears this is happening in UPI baremetal as well - the following bugs are probably duplicates:
This isn't happening with OVNKubernetes
, right?
This isn't happening with
OVNKubernetes
, right?
Yes, only with OpenShiftSDN
and Cilium.
I believe also affecting AWS
Seems it was caused by https://fedoraproject.org/wiki/Changes/iptables-nft-default
Could anyone check if it works when iptables backed was switched to the legacy implementation via alternatives
tool?
This appears to be fixed in latest 4.6 nightly:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.okd-2021-01-28-091721 True False 41s Cluster version is 4.6.0-0.okd-2021-01-28-091721
$ oc get network -o yaml | grep networkType
networkType: OpenShiftSDN
Could anyone confirm?
@vrutkovs How can I verify that? Currently I use ovn-kubernetes in my Cluster. Should I change the SDN to OpenShiftSDN in the running cluster and if that succeeds, that will be ok?
Clean install or upgrade from 4.5 to 4.6 would ensure its fixed. I didn't try changing SDN in the cluster - and its apparently not part of the issue
I think I am suffering same or similar issue to https://github.com/openshift/okd/issues/395 which was marked as duplicate of this issue.
In general my upgrade from 4.5.0-0.okd-2020-10-15-235428 to 4.6.0-0.okd-2021-01-23-132511 failed so started searching. Seeing above I have restored my cluster from backup and tried upgrading again from 4.5.0-0.okd-2020-10-15-235428 to 4.6.0-0.okd-2021-01-29-161622, which was nightly build created after comment that it should be solved in nightlies.
Therefore for me it doesn't seem to be fixed, unless I am suffering different issue. To describe my case I have bare metal 4.5 test cluster (3 masters, 4 workers), working perfectly on latest stable 4.5 (using istio and rook/ceph in case if that matters). Upgrade succeeds up to the point when machine-config is updated, where masters are succeeding (while workers got stuck) and then everything starts to misbehave:
However what makes me thinking my case might be slightly different, is that some of the API calls are failing with 503 error, for example:
After being stuck for a while (my cluster broke on the 4.6 upgrade); I yesterday tried again to set it up anew using 4.6.0-0.okd-2021-01-23-132511
and specifying OVNKubernetes
in my install-config.yaml - I made 2 runs, one using FCOS 32, one using 33..
apiVersion: v1
baseDomain: grv.scbs.ch
compute:
- hyperthreading: Enabled
name: worker
replicas: 0
controlPlane:
hyperthreading: Enabled
name: master
replicas: 3
metadata:
name: okd
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
networkType: OVNKubernetes
serviceNetwork:
- 172.30.0.0/16
platform:
none: {}
fips: false
pullSecret:
I then had the exact same outcome as described in #481 - cluster is up, but openshift-apiserver is not available - subsequently (i guess) console does not even get started..
[core@okd-bootstrap ~]$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication False True True 16h
cloud-credential 4.6.0-0.okd-2021-01-23-132511 True False False 16h
cluster-autoscaler 4.6.0-0.okd-2021-01-23-132511 True False False 16h
config-operator 4.6.0-0.okd-2021-01-23-132511 True False False 16h
console
csi-snapshot-controller 4.6.0-0.okd-2021-01-23-132511 True False False 16h
dns 4.6.0-0.okd-2021-01-23-132511 True False False 16h
etcd 4.6.0-0.okd-2021-01-23-132511 True False False 16h
image-registry 4.6.0-0.okd-2021-01-23-132511 True False True 16h
ingress 4.6.0-0.okd-2021-01-23-132511 True False False 16h
insights 4.6.0-0.okd-2021-01-23-132511 True False False 16h
kube-apiserver 4.6.0-0.okd-2021-01-23-132511 True False False 16h
kube-controller-manager 4.6.0-0.okd-2021-01-23-132511 True False False 16h
kube-scheduler 4.6.0-0.okd-2021-01-23-132511 True False False 16h
kube-storage-version-migrator 4.6.0-0.okd-2021-01-23-132511 True False False 16h
machine-api 4.6.0-0.okd-2021-01-23-132511 True False False 16h
machine-approver 4.6.0-0.okd-2021-01-23-132511 True False False 16h
machine-config 4.6.0-0.okd-2021-01-23-132511 True False False 16h
marketplace 4.6.0-0.okd-2021-01-23-132511 True False False 16h
monitoring False False True 16h
network 4.6.0-0.okd-2021-01-23-132511 True False False 16h
node-tuning 4.6.0-0.okd-2021-01-23-132511 True False False 16h
openshift-apiserver 4.6.0-0.okd-2021-01-23-132511 False False False 16h
openshift-controller-manager 4.6.0-0.okd-2021-01-23-132511 True False False 16h
openshift-samples
operator-lifecycle-manager 4.6.0-0.okd-2021-01-23-132511 True False False 16h
operator-lifecycle-manager-catalog 4.6.0-0.okd-2021-01-23-132511 True False False 16h
operator-lifecycle-manager-packageserver 4.6.0-0.okd-2021-01-23-132511 False True False 6s
service-ca 4.6.0-0.okd-2021-01-23-132511 True False False 16h
storage 4.6.0-0.okd-2021-01-23-132511 True False False 16h
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 16h Unable to apply 4.6.0-0.okd-2021-01-23-132511: an unknown error has occurred: MultipleErrors
I'm on UPI bare metal (VMWare vCloud) and it all worked before with OKD 4.5 and FCOS 32 ;-/ None of the "possible dupes" #395 #435 #414 helped to resolve this.
@vrutkovs it seems to me simply switching to OVNKubernetes
doesn't solve it..? maybe it needs a different configuration in install-config.yaml?
gather bootstrap log bundle: log-bundle-20210204080459.tar.gz
I yesterday tried again to set it up anew using 4.6.0-0.okd-2021-01-23-132511 and specifying OVNKubernetes in my install-config.yaml
Please file a new bug for that, that's off-topic to OpenshiftSDN support
@vrutkovs any progress on this? Any ETA? Plan? We are stuck with multiple bare metal clusters on OKD 4.5.
@vrutkovs the same here, still stuck on OKD 4.5. because of this issue.
I can't reproduce this issue, the cluster comes up correctly. Please collect a must-gather if you're still hitting the issue (it might be incomplete)
Here is my must-gather https://easyupload.io/igw5sx.
@trozet any clues how to debug that? OKD 4.5 -> 4.6 upgrade breaks one of openshift-apiserver pods, possibly some iptables problem on F32 -> F33 upgrade?
@PiotrKlimczak I think this is GCP-only, unless you have the load balancer implement in a similar way with a floating IP?
@errordeveloper i have UPI configuration (bare metal), not on GCP...
@mafna never mind, I can see @vrutkovs said this affects UPI too. @PiotrKlimczak I'm taking my comment back.
Do we have a solution for this issue yet?
Environment: vSphere OKD-Version: 4.6.0-0.okd-2021-02-14-205305 I now have similar problems after upgrading the vm hardware compatibility version to version 18 (previously it was 13)
This also happens with vm hardware compatibility version 15. EDIT: After a hacky downgrade to 13, the cluster is stable again.
Hello, Reproducing the Same problem with a fresh install of OKD 4.6 and FedoraCoreOS 33.20210201.3.0-live.x86_64 in my vSphere Environnement :
Virtual Machine where on version 14. Edit the Virtual Machine version to 13.
here is the line(3) in *.vmx file in the directory of each VM directly in ssh console : virtualHW.version = "14"
Now its OK ... Console is up and running ... don't understand why ...
Interesting, I am on VMware with hw compat 14 too, will try to downgrade and update the cluster over the weekend to see if it fixes the problem for me.
Hello ;)
I have destroy all my cluster (after downgrade virtualHW.version to 13)
I did a fresh install with new VMs on "virtualHW.version=14" with the excatly the same package version of all composant.
I have juste change the "networkType" to "OVNKubernetes" in my install-config file like this :
apiVersion: v1
baseDomain: okd.local
metadata:
name: lab
compute:
- hyperthreading: Enabled
name: worker
replicas: 0
controlPlane:
hyperthreading: Enabled
name: master
replicas: 3
networking:
clusterNetwork:
- cidr: X.X.X.X/X
hostPrefix: 23
networkType: OVNKubernetes
serviceNetwork:
- X.X.X.X/16
platform:
none: {}
fips: false
And Now All is Good Up And Running.
In conclusion there is a link between "Machine Virtual Hardware version" and "networkType" When you combine "Machine Virtual Hardware version" > 13 and "networkType: OpenShiftSDN"
Can't explain why ^^' seems weird ;)
Same issue here when upgrading from okd 4.5 to 4.6.0-0.okd-2021-01-23-132511. (Vsphere). Had some vm with hardvare version 14 and others with 15. Reverted all of them to HW Version 13 and upgrade to 4.6 finished sucessfully.
Can confirm upgrade successful after downgrading to hw 13. Any idea why newer versions are breaking?
Any idea why newer versions are breaking?
OpenshiftSDN bug - https://bugzilla.redhat.com/show_bug.cgi?id=1935591 and https://bugzilla.redhat.com/show_bug.cgi?id=1935539
Why are new versions breaking? Virtual hardware version 14 and greater enables geneve and vxlan tunnel offload. Take a look at:
I left this comment in the BZ but figured I might get more traction out here.
We have multiple vSphere clusters and different versions in each. The physical clusters are unable to reproduce the problem. I have an additional nested cluster where I can reproduce.
It would be great if we could get some additional information. If those impacted could tell us:
1.) ESXi version w/build numbers 2.) Type of switch used (Standard, Distributed, NSX-T Opaque) 3.) Switch security/policy - Promiscuous mode, MAC address changes, Forged transmits 4.) CPU model (e.g. Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz) 5.) virtual hardware version (we know it must be past 14)
If there is some thing interesting or unique about your vSphere cluster please note it. Oh and I don't care if its your homelab please let us know the details ;-)
Information about my test cluster where I saw this problem. At the moment, the cluster is running with OVNKubernetes. 1) VMware ESXi, 7.0.1, 17551050 2) Type of switch: Distributed 3) Switch security/policy:
If the network security policy has not been set correctly, I would be very grateful for a hint.
@Reamer thanks for the information. This afternoon I was able to reproduce the issue and the VMware folks reviewed the logs.
Snipit from the bz which I don't think is public.
"Vmxnet3 supports 8472 (when NSX is not installed) and 4789 (when NSX is installed). Any other destination port number won't work if guest overlay offload is to be used. Vmxnet3 plans to support different port numbers as configured by the user, but as of now it does not. So, the workarounds are either change the destination port or disable the tunnel segmentation offload from the guest."
We had already moved forward to disable the tunnel segmentation offload: https://github.com/openshift/machine-config-operator/pull/2482/files
VMware is also producing a KB article. I will post once I have it.
@vrutkovs Is it possible to cherry pick this PRs to OKD 4.6 and 4.7 ?
Both (merged in single https://github.com/openshift/machine-config-operator/pull/2495 for release-4.7
) should be included in https://amd64.origin.releases.ci.openshift.org/releasestream/4.7.0-0.okd/release/4.7.0-0.okd-2021-04-11-205428, could anyone give it a try?
Just updated my cluster to latest 4.7 (7 days old release) which should have above fix. Unfortunately newly added compute machines with hw level 17 (esxi 7) still fails. Compute node joins cluster correctly, however inter host comms are failing. Pod health checks are also failing.
@PiotrKlimczak was your cluster installed with platform: vsphere
or platform: none
?
Platform: none. Is that the reason? If so can I switch platform somehow?
Platform: none. Is that the reason? If so can I switch platform somehow?
Yep - https://github.com/openshift/machine-config-operator/pull/2559
Switching platforms isn't supported at the moment.
With platform: vsphere
I can confirm that this problem is solved. I use vSphere with hardware version 15.
I got the same error with versions below:
OKD: 4.7.0-0.okd-2021-03-07-090821
vSphere: 7.0 U1 HW version 18
platfrom: none
With the updated configuration below, I resolved this issue.
OKD: 4.7.0-0.okd-2021-04-24-103438
vSphere: 7.0 U1 HW version 18
platform: vsphere
I got the same error while using openshifSDN with Baremetal UPI, Openshift version 4.13 installation ... platform: none: {} ...
Describe the bug
Client connections to GCP LB time-out when routed to nodes that don't run router pod. All backend health check are shown as failing in GCP console, hence GCP routes clients to all nodes, which fails due to
externalTrafficPolicy: Local
inopenshift-ingress:service/router-default
.Version
IPI / openshift-install 4.6.0-0.okd-2020-11-27-200126
How reproducible
Create a cluster in GCP with
networkType: OpenShiftSDN
:Test LB ingress:
The output should be similar to this:
503 is expected since the test is against the LB IP without a hostname or URL, so it doesn't match any routes, but the connection time-outs are the problem.
Known Work-arounds
NB: Both work-arounds imply scaling CVO and ingress operator to zero, i.e.
1) Scale
openshift-ingress:deployments/router-default
to run on all worker nodes, remove CP nodes from LB in GCP console.2) Setting
externalTrafficPolicy: Cluster
inopenshift-ingress:service/router-default
makes health checks pass for CP nodes, and eliminates connection time-outs.