Closed elysweyr closed 2 years ago
Almost the same behavior with baremetal deployment
@Zveroloff Did you find any solution, yet?
Seems to be caused by https://github.com/openshift/okd/issues/1182
@elysweyr This seems to be a slightly different issue than what is happening in https://github.com/openshift/okd/issues/1182. In 1182 the bootstrap node cannot resolve api-int.
The rest of the cluster installs fine.
I could be misunderstanding you issue though...
This seems to be a slightly different issue than what is happening in https://github.com/openshift/okd/issues/1182.
@fortinj66 Thanks for your answer. You may be definitely correct but reverting to 4.9.28 resulted in a working cluster deployment. Unfortunately the cluster deployment on 4.9.28 wasn't reproducible a second time an identical backed up installer-config.yml
. I may have something severely messed up or some changes in 4.10.X broke new vSphere IPI deployments.
4.9.28? That would be an OCP install not an OKD install... OKD sits on FCOS and OCP sits on RHCOS and can have different characteristics...
@fortinj66 Thanks for the clarification. But as you can see the initial tries were done with 4.10.0-0.okd-2022-03-07-131213
. I'll give it one more try with my learnings from the successful 4.9.28-ocp
installation.
I would try it with a 4.9 OKD release... More apples to apples...
Why did you change the default machineNetwork from the default?
Yours
machineNetwork:
- cidr: 10.200.0.0/16
default:
machineNetwork:
- cidr: 10.0.0.0/16
I have seen issues when this is the same as the DHCP network.
You may also want to try networkType: OVNKubernetes
rather than OpenShiftSDN
This is now the default.
I would try it with a 4.9 OKD release... More apples to apples...
ACK
Why did you change the default machineNetwork from the default?
10.0.0.0/16
would cause a massive collision. 10.0/16
is used for addressing the site this okd cluster is located.
e.g. ingressVIP: 10.0.220.3
You may also want to try networkType: OVNKubernetes rather than OpenShiftSDN
Changing it to OpenShiftSDN
fixed some problems for me but I'll try it out again on 4.9.
I would change cidr: 10.200.0.0/16
to something that doesn't conflict with your DHCP network
I would change cidr: 10.200.0.0/16 to something that doesn't conflict with your DHCP network
It's not conflicting with my DHCP scopes:
I miss-read it... Sorry about that...
I looked at the logs. The bootstrap node never becomes ready so the API IP never becomes ready:
Mar 21 23:17:02 localhost.localdomain podman[27862]: 2022-03-21 23:17:02.477881408 +0000 UTC m=+0.107540816 image pull quay.io/openshift/okd-content@sha256:497ff9efd16f42d12eddd648dc5bddddcf478b4362c414c24b2801ba459d452e
Mar 21 23:17:02 localhost.localdomain bootkube.sh[27862]: Error: error creating container storage: the container name "etcdctl" is already in use by "e9a22cd4d9cd97addfbbb1a1f073012d5880743a832f9d66c8f0dd4ec3afc62d". You have to remove that container to be able to reuse that name.: that name is already in use
Mar 21 23:17:02 localhost.localdomain bootkube.sh[1650]: etcdctl failed. Retrying in 5 seconds...
I've seen this happen once during my testing OKD 4.10 . A re-install seemed to correct the issue although I do not know why it happened in the first place...
@elysweyr did you succeed with OKD 4.10? I haven't. I tried to follow https://github.com/openshift/okd/issues/1182 workaround (modified nsswitch.conf on the fly on bootstrap node), but it didn't help. I also tried to fallback to OKD 4.9, but it surprisingly led provisioner host OS to stuck Edit: host OS stuck was unrelated, but physical nodes are not booting with 4.9 images
@Zveroloff So far no success with 4.10. I am currently trying to deploy a working 4.9 installation, but as I am doing this at home (slow internet connection) I am encountering several timeouts of the installer and the process itself is taking forever.
Edit: The possibility to cache the downloaded images locally (in a simple way) would be really great, as re-downloading the images after a new deployment takes most of the time.
Seems like the hostname assignment on master0
seems to fail every time. I will investigate this further and create a log bundle afterwards.
-- Journal begins at Fri 2022-04-15 11:06:55 UTC. --
Apr 15 11:30:50 localhost systemd[1]: Starting Wait for a non-localhost hostname...
Apr 15 11:30:50 localhost mco-hostname[1142]: waiting for non-localhost hostname to be assigned
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: start operation timed out. Terminating.
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: Main process exited, code=killed, status=15/TERM
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: Failed with result 'timeout'.
Apr 15 11:35:50 localhost systemd[1]: Failed to start Wait for a non-localhost hostname.
Apr 15 11:35:50 localhost systemd[1]: node-valid-hostname.service: Consumed 1.316s CPU time.
Currently had success with OKD 4.10, 23-04-2022 build
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
I've basically used a newer version of the openshift-installer as well as OKD (4.11) itself and this combo made it work. Sorry for forgetting to post any updates!
Version
Platform:
vSphere IPI (custom network)
What happened?
The bootstrap not is not acquiring the API VIP and therefore the control plane nodes are not able to to reach
https://<api-vip>:22623/config/master
. The installation process will stuck here before even the control plane nodes were initially configured. This will continue indefinitely and is not resolved by waiting for a certain amount of time. As far as I understood the bootstrap node should use the API VIP in the beginning and after the initial configuration of the control plane nodes is finished they will continue to use the API VIP and the bootratp node will be destroyed.What you expected to happen?
How to reproduce it (as minimally and precisely as possible)?
Log bundle
log-bundle-20220322001704.tar.gz Exemplary configuration - machines receive DHCP leases in
10.0.220.0/24
(DHCP range.100
-.200
). Same behavior when I changed the VIPs to10.0.220.2
and10.0.220.3
References