openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.39k forks source link

[4.10] vSphere IPI: bootstrap node not aquiring VIP #5772

Closed elysweyr closed 2 years ago

elysweyr commented 2 years ago

Version

$ openshift-install version
./openshift-install 4.10.0-0.okd-2022-03-07-131213
built from commit 3b701903d96b6375f6c3852a02b4b70fea01d694
release image quay.io/openshift/okd@sha256:2eee0db9818e22deb4fa99737eb87d6e9afcf68b4e455f42bdc3424c0b0d0896
release architecture amd64

Platform:

vSphere IPI (custom network)

What happened?

The bootstrap not is not acquiring the API VIP and therefore the control plane nodes are not able to to reach https://<api-vip>:22623/config/master. The installation process will stuck here before even the control plane nodes were initially configured. This will continue indefinitely and is not resolved by waiting for a certain amount of time. As far as I understood the bootstrap node should use the API VIP in the beginning and after the initial configuration of the control plane nodes is finished they will continue to use the API VIP and the bootratp node will be destroyed.

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

$ ./openshift-install create cluster --dir ./okd-custom-net --log-level=debug

Log bundle

log-bundle-20220322001704.tar.gz Exemplary configuration - machines receive DHCP leases in 10.0.220.0/24 (DHCP range .100 - .200). Same behavior when I changed the VIPs to 10.0.220.2 and 10.0.220.3

References

Zveroloff commented 2 years ago

Almost the same behavior with baremetal deployment

elysweyr commented 2 years ago

@Zveroloff Did you find any solution, yet?

elysweyr commented 2 years ago

Seems to be caused by https://github.com/openshift/okd/issues/1182

fortinj66 commented 2 years ago

@elysweyr This seems to be a slightly different issue than what is happening in https://github.com/openshift/okd/issues/1182. In 1182 the bootstrap node cannot resolve api-int. and never finishes/cleans up.

The rest of the cluster installs fine.

I could be misunderstanding you issue though...

elysweyr commented 2 years ago

This seems to be a slightly different issue than what is happening in https://github.com/openshift/okd/issues/1182.

@fortinj66 Thanks for your answer. You may be definitely correct but reverting to 4.9.28 resulted in a working cluster deployment. Unfortunately the cluster deployment on 4.9.28 wasn't reproducible a second time an identical backed up installer-config.yml. I may have something severely messed up or some changes in 4.10.X broke new vSphere IPI deployments.

fortinj66 commented 2 years ago

4.9.28? That would be an OCP install not an OKD install... OKD sits on FCOS and OCP sits on RHCOS and can have different characteristics...

elysweyr commented 2 years ago

@fortinj66 Thanks for the clarification. But as you can see the initial tries were done with 4.10.0-0.okd-2022-03-07-131213. I'll give it one more try with my learnings from the successful 4.9.28-ocp installation.

fortinj66 commented 2 years ago

I would try it with a 4.9 OKD release... More apples to apples...

fortinj66 commented 2 years ago

Why did you change the default machineNetwork from the default?

Yours

machineNetwork:
  - cidr: 10.200.0.0/16

default:

  machineNetwork:
  - cidr: 10.0.0.0/16

I have seen issues when this is the same as the DHCP network.

You may also want to try networkType: OVNKubernetes rather than OpenShiftSDN This is now the default.

elysweyr commented 2 years ago

I would try it with a 4.9 OKD release... More apples to apples...

ACK

Why did you change the default machineNetwork from the default?

10.0.0.0/16 would cause a massive collision. 10.0/16 is used for addressing the site this okd cluster is located. e.g. ingressVIP: 10.0.220.3

You may also want to try networkType: OVNKubernetes rather than OpenShiftSDN

Changing it to OpenShiftSDN fixed some problems for me but I'll try it out again on 4.9.

fortinj66 commented 2 years ago

I would change cidr: 10.200.0.0/16 to something that doesn't conflict with your DHCP network

elysweyr commented 2 years ago

I would change cidr: 10.200.0.0/16 to something that doesn't conflict with your DHCP network

It's not conflicting with my DHCP scopes: grafik

fortinj66 commented 2 years ago

I miss-read it... Sorry about that...

I looked at the logs. The bootstrap node never becomes ready so the API IP never becomes ready:

Mar 21 23:17:02 localhost.localdomain podman[27862]: 2022-03-21 23:17:02.477881408 +0000 UTC m=+0.107540816 image pull  quay.io/openshift/okd-content@sha256:497ff9efd16f42d12eddd648dc5bddddcf478b4362c414c24b2801ba459d452e
Mar 21 23:17:02 localhost.localdomain bootkube.sh[27862]: Error: error creating container storage: the container name "etcdctl" is already in use by "e9a22cd4d9cd97addfbbb1a1f073012d5880743a832f9d66c8f0dd4ec3afc62d". You have to remove that container to be able to reuse that name.: that name is already in use
Mar 21 23:17:02 localhost.localdomain bootkube.sh[1650]: etcdctl failed. Retrying in 5 seconds...
fortinj66 commented 2 years ago

I've seen this happen once during my testing OKD 4.10 . A re-install seemed to correct the issue although I do not know why it happened in the first place...

Zveroloff commented 2 years ago

@elysweyr did you succeed with OKD 4.10? I haven't. I tried to follow https://github.com/openshift/okd/issues/1182 workaround (modified nsswitch.conf on the fly on bootstrap node), but it didn't help. I also tried to fallback to OKD 4.9, but it surprisingly led provisioner host OS to stuck Edit: host OS stuck was unrelated, but physical nodes are not booting with 4.9 images

elysweyr commented 2 years ago

@Zveroloff So far no success with 4.10. I am currently trying to deploy a working 4.9 installation, but as I am doing this at home (slow internet connection) I am encountering several timeouts of the installer and the process itself is taking forever.

Edit: The possibility to cache the downloaded images locally (in a simple way) would be really great, as re-downloading the images after a new deployment takes most of the time.

elysweyr commented 2 years ago

Seems like the hostname assignment on master0 seems to fail every time. I will investigate this further and create a log bundle afterwards.

-- Journal begins at Fri 2022-04-15 11:06:55 UTC. --
Apr 15 11:30:50 localhost systemd[1]: Starting Wait for a non-localhost hostname...
Apr 15 11:30:50 localhost mco-hostname[1142]: waiting for non-localhost hostname to be assigned
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: start operation timed out. Terminating.
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: Main process exited, code=killed, status=15/TERM
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: Failed with result 'timeout'.
Apr 15 11:35:50 localhost systemd[1]: Failed to start Wait for a non-localhost hostname.
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: Consumed 1.316s CPU time.
Zveroloff commented 2 years ago

Currently had success with OKD 4.10, 23-04-2022 build

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 2 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 2 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 2 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/5772#issuecomment-1280099357): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
elysweyr commented 4 months ago

I've basically used a newer version of the openshift-installer as well as OKD (4.11) itself and this combo made it work. Sorry for forgetting to post any updates!