openshift-metal3 / dev-scripts

Scripts to automate development/test setup for openshift integration with https://github.com/metal3-io/
Apache License 2.0
93 stars 185 forks source link

The Cluster creation fails with Error: could not contact Ironic API: context deadline exceeded #1586

Open rakeshk121 opened 11 months ago

rakeshk121 commented 11 months ago

Describe the bug The cluster creation fails with Error:

level=debug msg=ironic_node_v1.openshift-master-host[2]: Still creating... [59m50s elapsed]
level=error msg=Error: could not contact Ironic API: context deadline exceeded
level=error msg=  with ironic_node_v1.openshift-master-host[1],
level=error msg=  on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg=  13: resource "ironic_node_v1" "openshift-master-host" {
level=error msg=Error: could not contact Ironic API: timeout reached

To Reproduce

As I'm trying to setup OKD , By referring to this commit https://github.com/openshift-metal3/dev-scripts/pull/1578/commits/f9265103273200e2d75fa6c918765433dd85d0d7 , #1578

git clone https://github.com/openshift-metal3/dev-scripts
cp config_example.sh config_$USER.sh

I have set the following in the config_core.sh

export OPENSHIFT_RELEASE_IMAGE=registry.ci.openshift.org/origin/release:4.13.0-0.okd-2023-08-18-135805
export PULL_SECRET_FILE=pull_secret.json
export OPENSHIFT_RELEASE_TYPE=okd
export IP_STACK=v4

Expected/observed behavior The cluster is created and can be accessed.

Additional context

test image

Here is the log file: 06_create_cluster-2023-09-20-082531.log

bshephar commented 11 months ago

Hey, I'll try reproducing with the same release image and get back to you.

bshephar commented 11 months ago

This appears to have worked for me:

[m3@localhost dev-scripts]$ oc get clusterversion
NAME      VERSION                          AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.okd-2023-08-18-135805   True        False         58m     Cluster version is 4.13.0-0.okd-2023-08-18-135805
[m3@localhost dev-scripts]$ oc get bmh -A
NAMESPACE               NAME              STATE                    CONSUMER                      ONLINE   ERROR   AGE
openshift-machine-api   ostest-master-0   externally provisioned   ostest-fr5ld-master-0         true             92m
openshift-machine-api   ostest-master-1   externally provisioned   ostest-fr5ld-master-1         true             92m
openshift-machine-api   ostest-master-2   externally provisioned   ostest-fr5ld-master-2         true             92m
openshift-machine-api   ostest-worker-0   provisioned              ostest-fr5ld-worker-0-jbq5b   true             92m
openshift-machine-api   ostest-worker-1   provisioned              ostest-fr5ld-worker-0-84t6t   true             92m

I'll have to dig into the logs you provided to see if there are any clues about why yours is failing and mine isn't.

I'm setting:

[m3@localhost dev-scripts]$ grep -Ev '^#|^$' config_m3.sh
export OPENSHIFT_RELEASE_IMAGE=registry.ci.openshift.org/origin/release:4.13.0-0.okd-2023-08-18-135805
export PULL_SECRET_FILE=pull_secret.json
export OPENSHIFT_RELEASE_TYPE=okd
export IP_STACK=v4
export NUM_EXTRA_WORKERS=2

So we should be deploying the same thing here. I'm running on a CentOS9-Stream host:

[m3@localhost dev-scripts]$ cat /etc/redhat-release
CentOS Stream release 9

I see you're running Rocky 8.8:

❯ grep PRETTY_NAME 06_create_cluster-2023-09-20-082531.log
2023-09-20 08:25:31 +++(/etc/os-release:7): source(): PRETTY_NAME='Rocky Linux 8.8 (Green Obsidian)'

It would probably be helpful if you were able to provide logs from the Bootstrap node, since that is where the ironic container should be running: https://docs.okd.io/latest/support/troubleshooting/troubleshooting-installations.html#gathering-bootstrap-diagnostic-data_troubleshooting-installations

Check to see if the Ironic is listening on the bootstrap node:

sudo ss -tpnl | grep 6385

See if there are any restarting containers:

podman ps -a

Check the logs of the Ironic container specifically:

sudo podman logs ironic

That's probably the best place to start trying to narrow things down.

rakeshk121 commented 11 months ago

Thanks @bshephar .

Yes , Im setting the variables which matches your settings,

[core@nodea08 dev-scripts]$ grep -Ev '^#|^$' config_core.sh 
export OPENSHIFT_RELEASE_IMAGE=registry.ci.openshift.org/origin/release:4.13.0-0.okd-2023-08-18-135805
export PULL_SECRET_FILE=pull_secret.json
export OPENSHIFT_RELEASE_TYPE=okd
export NUM_EXTRA_WORKERS=2
export IP_STACK=v4

Ironic is listening on the bootstrap node:

[core@localhost ~]$ sudo ss -tpnl | grep 6385
LISTEN 0      128                *:6385             *:*    users:(("ironic",pid=6379,fd=5),("ironic",pid=6379,fd=4))  

I do not see any restarting of the containers.

[core@localhost ~]$ sudo podman ps -a
CONTAINER ID  IMAGE                                                                                                  COMMAND               CREATED            STATUS                        PORTS       NAMES
1c51a4cb99f1  quay.io/openshift/okd-content@sha256:50ec87cbc91ded3b7cd41e54da9a21f0835cdfc36daac0bd1dca65737d70aa9f                        About an hour ago  Up About an hour                          dnsmasq
5fe7599f4302  quay.io/openshift/okd-content@sha256:a70e232022f49a883e1facb48690d6c16fdbdc79b2ff4fc807bf07825eb7c380  /bin/copy-metal -...  About an hour ago  Exited (0) About an hour ago              coreos-downloader
b17f707d9374  quay.io/openshift/okd-content@sha256:50ec87cbc91ded3b7cd41e54da9a21f0835cdfc36daac0bd1dca65737d70aa9f                        About an hour ago  Up About an hour                          httpd
ef6ba4a14d4e  quay.io/openshift/okd-content@sha256:ad2224900eabbb62bc83b7b356a0491bdb5798b57c2351f5df05e01a3b84ac90                        About an hour ago  Up About an hour                          image-customization
a8737f33d92a  quay.io/openshift/okd-content@sha256:50ec87cbc91ded3b7cd41e54da9a21f0835cdfc36daac0bd1dca65737d70aa9f                        About an hour ago  Up About an hour                          ironic
ffa193396b97  quay.io/openshift/okd-content@sha256:50ec87cbc91ded3b7cd41e54da9a21f0835cdfc36daac0bd1dca65737d70aa9f                        About an hour ago  Up About an hour                          ironic-inspector
d5f382e9cb36  quay.io/openshift/okd-content@sha256:50ec87cbc91ded3b7cd41e54da9a21f0835cdfc36daac0bd1dca65737d70aa9f                        About an hour ago  Up About an hour                          ironic-ramdisk-logs
f7847bdcf80c  quay.io/openshift/okd-content@sha256:1a245dbcc0684c6ca15c9ea67fbfa55073c5d672ea7b48f50c14c371b09de558  start --tear-down...  15 minutes ago     Up 15 minutes                         

Attaching the ironic logs here:

ironic.log

bshephar commented 11 months ago

Hey @rakeshk121 .

Ok, two thoughts: 1. Was this IP address reachable at all during the bootstrap process? 192.168.111.5

$ curl -s -o /dev/null -w "%{http_code}" https://192.168.111.5:6443 -k

I originally thought that maybe this just happened at the end of the deployment failure, but I think that VIP should still actually be available even if it does fail:

2023-09-20 09:26:01 E0920 09:26:01.513229  161368 memcache.go:238] couldn't get current server API group list: Get "https://api.ostest.test.metalkube.org:6443/api?timeout=32s": dial tcp 192.168.111.5:6443: connect: no route to host
2023-09-20 09:26:04 E0920 09:26:04.585280  161368 memcache.go:238] couldn't get current server API group list: Get "https://api.ostest.test.metalkube.org:6443/api?timeout=32s": dial tcp 192.168.111.5:6443: connect: no route to host
  1. It looks like Ironic is working there. So, assuming that IP address is indeed reachable during the bootstrap process. We might need a must-gather to see if there is anything else happening on that node. If it's not reachable , then that is the first problem we need to solve.

bdlink commented 2 months ago

I am having what seems to be a similar failure. config parameters:

export OPENSHIFT_RELEASE_IMAGE=quay.io/openshift/okd:4.15.0-0.okd-2024-03-10-010116
export PULL_SECRET_FILE=pull_secret.json
export OPENSHIFT_RELEASE_TYPE=okd
export IP_STACK=v4
export NETWORK_TYPE="OVNKubernetes"
export MASTER_DISK=90
export MASTER_VCPU=4
export NUM_WORKERS=0
export NUM_EXTRA_WORKERS=0

Using WORKING_DIR=/home/dev-scripts

I am running on a fresh install of CentOS Stream 9, and the process after make is that step 06 times out after an hour. The bootstrap node comes up, the bootstrap API comes up.

sudo ss -tpnl | grep 6385 returns nothing. sudo podman ps does not show restarting containers (inside or outside the bootstrap node) sudo podman logs ironic returns Error: no container with name or ID "ironic" found: no such container

The virtual machines ostest_master_0 , _1, and_2 are shut down. oc get bmh -A shows three machines on line. oc get po -n openshift-machine-api shows: No resources found in openshift-machine-api namespace

As I am using a current version of yq (v4.44.2) I had to remove the "y" on line 102 of 01_install_requirements.sh 06_create_cluster-2024-06-18-075053.log

bdlink commented 2 months ago

Looking at the use of yq in the bash scripts, I think the ones in utils.sh may not work with yq v4 (needing a period before []). This could be the cause of the issue. However, I am not an expert in yq.

In the bootstrap there are fewer podman images running than Rakeshk121 had:

sudo podman ps -a
CONTAINER ID  IMAGE                                                                                                  COMMAND               CREATED        STATUS        PORTS       NAMES
3308f5f6df18  quay.io/openshift/okd-content@sha256:90eb227746e445d6e258d3c9aaccbbdeca517ffb0dcaf5b880c2bde4f74aaae2  /bin/rundnsmasq       11 hours ago   Up 11 hours               dnsmasq
e4b3a442040a  quay.io/openshift/okd-content@sha256:90eb227746e445d6e258d3c9aaccbbdeca517ffb0dcaf5b880c2bde4f74aaae2  /bin/runlogwatch....  11 hours ago   Up 11 hours               ironic-ramdisk-logs
80e66f86071b  quay.io/openshift/okd-content@sha256:90eb227746e445d6e258d3c9aaccbbdeca517ffb0dcaf5b880c2bde4f74aaae2  /bin/runhttpd         11 hours ago   Up 11 hours               httpd
49d9ecfa58df  quay.io/openshift/okd-content@sha256:9f3f8f11fd743a332f8328b774bed1854c5d5d058663eb122289191bcb0cee73  start --tear-down...  3 minutes ago  Up 3 minutes              cluster-bootstrap