openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.39k forks source link

IPI fails on iRMC servers due to inspect timeout #5504

Closed hs0210 closed 2 years ago

hs0210 commented 2 years ago

Version

$ openshift-baremetal-install version
openshift-baremetal-install 4.10.0-0.nightly-2021-12-20-231053
built from commit 9f37ece3620d14e48507f2afc5cf6a667ca2cef0
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:26f811bd37c593564093aa8e323cf81637c49a9527dbe6772c885fa9f55ab684
release architecture amd64

Platform:

baremetal

IPI

What happened?

Initial error phenomenon

Bootstrap VM does not start the ironic related containers.

What we tried

We manually pull image-customization-controller container, and after that the ironic related containers can successfully start.

Next error phenomenon

The master nodes got stuck in a restart loop and eventually the IPI failed due to inspect timeout.

The following is the error log from .openshift_install.log

time="2021-12-21T07:30:59Z" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'"
time="2021-12-21T07:30:59Z" level=error
time="2021-12-21T07:30:59Z" level=error msg="  on ../../tmp/openshift-install-masters-4047236787/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2021-12-21T07:30:59Z" level=error msg="  13: resource \"ironic_node_v1\" \"openshift-master-host\" {"

The following is the error log from ServerView errorFormSeverView

What you expected to happen?

IPI successfully run on iRMC.

How to reproduce it (as minimally and precisely as possible)?

  1. Download the latest OCP nightly
  2. Fill in install-config
  3. Execute create cluster
  4. Pull image-customization-controller after bootstrap started.
$ mkdir ~/clusterconfigs
$ cp ~/install-config.yaml ~/clusterconfigs/
$ openshift-baremetal-install --dir ~/clusterconfigs create manifests
$ openshift-baremetal-install --dir ~/clusterconfigs --log-level debug create cluster
# wait for bootstrap starts, then execute the following command on Bootstrap VM
$ podman pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f15e656e81f2b54f98c4e1c0eb376eacc63bbb7095c0cf743bdcb8fa3b0a6ad

Anything else we need to know?

Before OCP 4.10.0-0.nightly-2021-12-14-083101, the IPI can run successfully on iRMC. We are not sure if this error is related to #5473

staebler commented 2 years ago

/label platform/baremetal

dtantsur commented 2 years ago

Could you maybe retry with 4.10.0-0.nightly-2021-12-21-130047 if possible just to be sure?

hs0210 commented 2 years ago

Could you maybe retry with 4.10.0-0.nightly-2021-12-21-130047 if possible just to be sure?

@dtantsur I tested with 4.10.0-0.nightly-2021-12-21-130047, the same error occurs.

hs0210 commented 2 years ago

I did a comparison test with the the branch of #5473 based on the release image of 4.10.0-0.nightly-2021-12-18-034942

So it seems that the above error is indeed caused by #5473

zaneb commented 2 years ago

Let's start at the beginning, and figure out why you have to manually pull the image-customization-controller image. Can you attach the output of journalctl -u ironic.service from the bootstrap?

hs0210 commented 2 years ago

@zaneb Bootstrap kept trying to download the image-customization, but failed. The following is the corresponding error log and the complete log can view ironic.service.log

Jan 05 08:11:29 bootstrap startironic.sh[18301]: + sudo podman run -d --net host --privileged --name image-customization --env DEPLOY_ISO=/shared/html/images/ironic-python-agent.iso --env DEPLOY_INITRD=/shared/html/images/ironic-python-agent.initramfs --env IRONIC_BASE_URL=http://192.168.30.201 --env 'IRONIC_RAMDISK_SSH_KEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIaVOPtr+CGtxC2JqzMiRzY+TVRrCMJn1OiQmA3VMTlq kni@localhost.localdomain' --env IRONIC_AGENT_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f44d877415c320f301e07734247e7263b27b925d65fc79ecebee247dee691990 --env IRONIC_AGENT_PULL_SECRET=<IRONIC_AGENT_PULL_SECRET>
Jan 05 08:11:29 bootstrap sudo[21589]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/podman run -d --net host --privileged --name image-customization --env DEPLOY_ISO=/shared/html/images/ironic-python-agent.iso --env DEPLOY_INITRD=/shared/html/images/ironic-python-agent.initramfs --env IRONIC_BASE_URL=http://192.168.30.201 --env IRONIC_RAMDISK_SSH_KEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIaVOPtr+CGtxC2JqzMiRzY+TVRrCMJn1OiQmA3VMTlq kni@localhost.localdomain --env IRONIC_AGENT_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f44d877415c320f301e07734247e7263b27b925d65fc79ecebee247dee691990 --env
Jan 05 08:11:29 bootstrap sudo[21589]:     root : (command continued) IRONIC_AGENT_PULL_SECRET=<IRONIC_AGENT_PULL_SECRET> --env PROVISIONING_INTERFACE=ens4 --env REGISTRIES_CONF_PATH=/tmp/containers/registries.conf --entrypoint ["/image-customization-server", "--nmstate-dir=/tmp/nmstate/", "--images-publish-addr=http://0.0.0.0:8084"] -v /tmp/nmstate/:/tmp/nmstate/ -v ironic:/shared:z -v /etc/containers:/tmp/containers:z quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f15e656e81f2b54f98c4e1c0eb376eacc63bbb7095c0cf743bdcb8fa3b0a6ad
Jan 05 08:11:29 bootstrap sudo[21589]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 05 08:11:29 bootstrap startironic.sh[18301]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f15e656e81f2b54f98c4e1c0eb376eacc63bbb7095c0cf743bdcb8fa3b0a6ad...
Jan 05 08:11:29 bootstrap startironic.sh[18301]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f15e656e81f2b54f98c4e1c0eb376eacc63bbb7095c0cf743bdcb8fa3b0a6ad: error pinging docker registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 127.0.0.1:53: no such host
Jan 05 08:11:29 bootstrap sudo[21589]: pam_unix(sudo:session): session closed for user root
hs0210 commented 2 years ago

@zaneb The image-customization-controller image pull failure can be fixed by #5513. And the second error mentioned above still exists.

zaneb commented 2 years ago

Cool, thanks.

There's a known bug with dual-stack deployments where the Provisioning VIP and API VIP are in different network families - that causes the host to be unable to download the rootfs. So that could be what's happening here (if so #5514 will be the fix).

However, the snippet of log output we can see on the console appears to indicate that we don't have a rootfs specified at all? That was added by openshift/ironic-image#243, which sets it in the environment variable IRONIC_KERNEL_PARAMS. Is it possible that the irmc driver is not using the [pxe]/pxe_append_params option config and we need to set it in [irmc]/kernel_append_params as well?

@dtantsur this looks suspicious: https://opendev.org/openstack/ironic/src/branch/master/ironic/conf/irmc.py#L110

zaneb commented 2 years ago

TBH I'd have expected https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L175 to have handled this.

hs0210 commented 2 years ago

Is it possible that the irmc driver is not using the [pxe]/pxe_append_params option config and we need to set it in [irmc]/kernel_append_params as well?

Thanks for your tip, we will investigate this.

dtantsur commented 2 years ago

we need to set it in [irmc]/kernel_append_params as well?

We do since the fallback to [pxe] is deprecated, but should work.

@hs0210 do you see this message in your logs? https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L172-L174

dtantsur commented 2 years ago

Created https://github.com/metal3-io/ironic-image/pull/342 to set kernel_append_params.

hs0210 commented 2 years ago

do you see this message in your logs? https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L172-L174

@dtantsur No, I didn't see this in ironic-conductor.log. But I saw the following message.

2022-01-06 08:20:50.029 1 WARNING oslo_config.cfg [-] Deprecated: Option "pxe_append_params" from group "pxe" is deprecated. Use option "kernel_append_params" from group "pxe".
zaneb commented 2 years ago

I wonder if we're actually getting '' instead of None at https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L171

dtantsur commented 2 years ago

I wonder if we're actually getting '' instead of None at https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L171

This should not be happening. The default for [irmc]kernel_append_params is explicitly None...

hs0210 commented 2 years ago

@zaneb @dtantsur , Hi, I tested metal3-io/ironic-image#342 . From the ironic.log, [irmc]kernel_append_params does have the corresponding value:

2022-01-11 07:36:46.355 1 DEBUG oslo_service.service [-] irmc.kernel_append_params      = nofb nomodeset vga=normal ipa-insecure=1 sshkey=<sshkey> ip=dhcp coreos.live.rootfs_url=http://192.168.20.2:80/images/ironic-python-agent.rootfs ignition.firstboot ignition.platform.id=metal log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2615

But from the ServerView, the same error still occurs. error

zaneb commented 2 years ago

Is it possible to capture the whole log so we can see what kernel params are actually making it to the host, and how it is failing to acquire the rootfs image?

openshift-ci[bot] commented 2 years ago

@rhjanders: Bugzilla bug referencing is only supported for Pull Requests, not issues.

In response to [this](https://github.com/openshift/installer/issues/5504#issuecomment-1010638120): >/bugzilla refresh Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rhjanders commented 2 years ago

^ wrong browser tab :)

hs0210 commented 2 years ago

Is it possible to capture the whole log so we can see what kernel params are actually making it to the host, and how it is failing to acquire the rootfs image?

@zaneb I tried video record, but the log output is jumpy, it is difficult to capture the whole log, I intercepted the following logs that seem to be relevant. Screenshot from 2022-01-12 21-55-24

zaneb commented 2 years ago

Unfortunately the parameters we're looking for appear after the ssh key, which is being truncated, so we can't actually tell if they're there or not :(

hs0210 commented 2 years ago

@zaneb Ah, the following is the log after ssh key. Screenshot from 2022-01-12 22-29-31 Screenshot from 2022-01-12 22-41-32

zaneb commented 2 years ago

I don't see the rest of that line appearing later, so I think it must have hit some sort of character limit and been truncated.

On the bright side, if we are getting the ssh key we should be getting the rootfs arg. We've seen other problems with trying to use the wrong address family to access the rootfs URL (i.e. IPv6 when the box can only do IPv4, or vice-versa: https://bugzilla.redhat.com/show_bug.cgi?id=2034527)

Also, since the ssh key is populated, you should be able to ssh into the host (as the user core) and get the full journal.

hs0210 commented 2 years ago

On the bright side, if we are getting the ssh key we should be getting the rootfs arg.

@zaneb From the follwoing log of ironic-conductor, I saw rootfs arg had been loaded to ramdisk from pxe[kernel_append_params] and this is the complete ironic-conductor.log

2022-01-12 11:19:31.072 1 DEBUG ironic.drivers.modules.pxe_base [req-f5c3af2a-035a-4868-af24-64c1a4796d1b bootstrap-user - - - -] 
Ramdisk (i)PXE boot for node 9c7bb22c-2a71-4f91-ae2e-80d6bee03516 has been prepared with kernel params {'deployment_aki_path': '9c7bb22c-2a71-4f91-ae2e-80d6bee03516/deploy_kernel',
'deployment_ari_path': '9c7bb22c-2a71-4f91-ae2e-80d6bee03516/deploy_ramdisk', 'aki_path': 'no_kernel', 'ari_path': 'no_ramdisk', 'pxe_append_params': 'nofb nomodeset vga=normal ipa-insecure=1 
sshkey="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIaVOPtr+CGtxC2JqzMiRzY+TVRrCMJn1OiQmA3VMTlq kni@localhost.localdomain" 
ip=dhcp coreos.live.rootfs_url=http://192.168.20.2:80/images/ironic-python-agent.rootfs ignition.firstboot 
ignition.platform.id=metal ipa-debug=1 ipa-inspection-collectors=default,extra-hardware,logs ipa-enable-vlan-interfaces=all 
ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 ipa-inspection-callback-url=http://192.168.20.2:5050/v1/continue ipa-api-url=http://192.168.20.2:6385 ipa-global-request-id=req-f5c3af2a-035a-4868-af24-64c1a4796d1b', 
'tftp_server': '192.168.20.2', 'ipxe_timeout': 0, 'ipa-api-url': 'http://192.168.20.2:6385'} prepare_ramdisk /usr/lib/python3.6/site-packages/ironic/drivers/modules/pxe_base.py:224

Also, since the ssh key is populated, you should be able to ssh into the host (as the user core) and get the full journal.

We logged into the host and found the relevant logs, but unfortunately, some parameters are still omitted.

Jan 13 07:59:15 localhost dracut-cmdline[382]: Using kernel command line parameters: ip=auto rd.driver.pre=dm_multipath BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-b0c17d2741adcd571b08d69dbae84aa880469604b4ea9d5e7b6ccbe53c3a3cf2/vmlinuz-4.18.0-305.28.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu ignition.firstboot ostree=/ostree/boot.1/rhcos/b0c17d2741adcd571b08d69dbae84aa880469604b4ea9d5e7b6ccbe53c3a3cf2/0
Jan 13 07:59:15 localhost systemd-journald[344]: Missed 7 kernel messages
hs0210 commented 2 years ago

@zaneb @dtantsur Hi, I noticed that openshift/ironic-image is already several commits behind metal3-io/ironic-image, I'm not sure if there are some necessary commits that haven't been backported to openshift/ironic-image, could you help take a look?

zaneb commented 2 years ago

ip=dhcp coreos.live.rootfs_url=http://192.168.20.2:80/images/ironic-python-agent.rootfs

These options are consistent, so it doesn't look like https://bugzilla.redhat.com/show_bug.cgi?id=2034527

The rootfs is specified (and looks correct) so it isn't an issue with not passing the full kernel args for this driver.

I think we'll need to see the whole journal. However, the log you pasted shows that the actual kernel arguments bear no relation to the ones passed by ironic, e.g. ignition.platform.id=qemu. I strongly suspect it's booting into an old image from disk, not the one we are trying to PXE boot it from.

hs0210 commented 2 years ago

However, the log you pasted shows that the actual kernel arguments bear no relation to the ones passed by ironic, e.g. ignition.platform.id=qemu. I strongly suspect it's booting into an old image from disk, not the one we are trying to PXE boot it from.

Very sorry, I pasted a wrong log, the ignition.platform.id=qemu was from bootstrap's journal. The msater node had not been assigned an IP, so we can't access it with ssh. We can just obtain logs from ServerView, and the log below is the part of the journalctl command output we intercepted from ServerView. MicrosoftTeams-image

IMO, the current situation is this:

  1. From the ironic-conductor log, [pxe]pxe_append_params had been successfully set.
  2. irmc indeed boot from pxe.
  3. irmc kernel module didn't load rootfs related parameters successfully.
hs0210 commented 2 years ago

@zaneb @dtantsur We found out that IRONIC_KERNEL_PARAMS is not set in the httpd container, since coreos-download runs after httpd, the ROOTFS_FILE is not exist, so IRONIC_KERNEL_PARAMS wouldn't be set, resulting in no rootfs parameter available when irmc boot.

We sent #5541 fix this, PTAL.

zaneb commented 2 years ago

@zaneb @dtantsur We found out that IRONIC_KERNEL_PARAMS is not set in the httpd container, since coreos-download runs after httpd, the ROOTFS_FILE is not exist, so IRONIC_KERNEL_PARAMS wouldn't be set, resulting in no rootfs parameter available when irmc boot.

We sent #5541 fix this, PTAL.

Thanks, that fix looks good. It will need to be linked to a bugzilla.

How does it work on VMs? Is it possible that some (but only some) drivers use the kernel params from the ironic config file for inspection in preference to those in the inspector.ipxe file?

hs0210 commented 2 years ago

Thanks, that fix looks good. It will need to be linked to a bugzilla.

@zaneb We are preparing the bugzilla.

How does it work on VMs? Is it possible that some (but only some) drivers use the kernel params from the ironic config file for inspection in preference to those in the inspector.ipxe file?

Ah, very sorry, we don't have a VM test environment.

After fixing this issue, we continue do IPI on irmc and a new error occurred, we have sent issue #5542 to track it, PTAL.

zaneb commented 2 years ago

How does it work on VMs? Is it possible that some (but only some) drivers use the kernel params from the ironic config file for inspection in preference to those in the inspector.ipxe file?

Ah, very sorry, we don't have a VM test environment.

What I mean is, without your patch it has been working for us in CI and in QE's testing. Yet your patch would appear to be essential to getting the right parameters into inspector.ipxe. So is there something different about the irmc driver that makes it required only there?

hs0210 commented 2 years ago

@zaneb According to derekhiggins's comment in #5541, your thought should be right. There's difference between your environment and irmc. The ironic in your test environment should create the pxe config successfully and then use dualboot.ipxe. But the pxe config(pxelinux.cfg/90-1b-0e-e6-7b-88) was not found when irmc boot, so irmc used inspector.ipxe. see details at zhou's comment in #5541