ryanhay / ocp4-metal-install

Install OpenShift 4 on Bare Metal - UPI
197 stars 371 forks source link

Openshift 4.7. There is a known bug that is causing the boostrap and install phases to fail on vmware version 14 VM's. #6

Open simonboydfoley opened 3 years ago

simonboydfoley commented 3 years ago

You may want to put a warning up on your docs :-) #2 months of pain.

https://bugzilla.redhat.com/show_bug.cgi?id=1935539

RH have issued a warning on their web site

Virtual machines (VMs) configured to use virtual hardware version 14 or greater might result in a failed installation. It is recommended to configure VMs with virtual hardware version 13. This is a known issue that is being addressed in BZ#1935539.

I have seen this with both ESXi 7.0b and Proxmox 6.3-6 ..may save people 2 months of pain I have been going through.

The nature of the problem means you just have to keep on restarting the bootstrap phase / install phase repeatedly ever time it times out and fails ... eventually, if lucky you will get through the problem within 24hrs.

simonboydfoley commented 3 years ago

Update: I have now spent 3 months trying to get open shift 4.7 to work and I have installed it in excess of 30 times with packet captures running. I have used ESXi and Proxmox and also installed 4 different NIC Cards with different levels of offload and SR-IOV support. In essence RH have completely screwed the pooch with Openshift 4.7.

The problem seems to be something to do with broken VmxNet3 Network Virtualisation support. I have tried disabling any tunnelling UDP Offload checksum with little success on all the NIC Cards, so the workaround suggested in the BZ above (disabling udp checksum offload) I am highly suspicious of that workaround being valid.

The only way to get 4.7 to install at all is to switch from OpenShiftSDN to OVN and abandon VMXNet3

networkType: OVNKubernetes

Then it installed first time no problem on OVN.

I just can not get 4.7 to work at all when using OpenshiftSDN (vmxnet3 network emulation in the cluster). We should be clear here that what Openshift uses in its network emulation layer is different to what your VM Software uses as ist NIC Emulation ... but the two interact and there may be some causality ... but lets not mix up the two different aspects of emulation (VM NIC and vswitch emulation) and how openshift seems to dynamically implement its networks in the cluster nodes;

clusterNetwork:

Some people suggest using VMWare Hardware version 13 (ESXi 6.5) but I have not got that far back. I have tried ESXi 7 and 6.7 so far with the workarounds ... no success. I have tried every NIC emulation type in Proxmox and the same problem still exists.

So my advice to you would to be to update your lovely documentation with a warning that if you are using Openshift 4.7 ...
networkType: OVNKubernetes

Geethanath10 commented 2 years ago

[root@ocp-svc ~]# ~/openshift-install --dir ~/ocp-install wait-for bootstrap-complete --log-level=debug DEBUG OpenShift Installer 4.9.18 DEBUG Built from commit eb132dae953888e736c382f1176c799c0e1aa49e INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.ocp.lan:6443... DEBUG Still waiting for the Kubernetes API: the server has asked for the client to provide credentials

Have any idea about this error.