Closed hs0210 closed 2 years ago
/label platform/baremetal
Could you maybe retry with 4.10.0-0.nightly-2021-12-21-130047 if possible just to be sure?
Could you maybe retry with 4.10.0-0.nightly-2021-12-21-130047 if possible just to be sure?
@dtantsur I tested with 4.10.0-0.nightly-2021-12-21-130047, the same error occurs.
I did a comparison test with the the branch of #5473 based on the release image of 4.10.0-0.nightly-2021-12-18-034942
So it seems that the above error is indeed caused by #5473
Let's start at the beginning, and figure out why you have to manually pull the image-customization-controller image. Can you attach the output of journalctl -u ironic.service
from the bootstrap?
@zaneb Bootstrap kept trying to download the image-customization
, but failed. The following is the corresponding error log and the complete log can view ironic.service.log
Jan 05 08:11:29 bootstrap startironic.sh[18301]: + sudo podman run -d --net host --privileged --name image-customization --env DEPLOY_ISO=/shared/html/images/ironic-python-agent.iso --env DEPLOY_INITRD=/shared/html/images/ironic-python-agent.initramfs --env IRONIC_BASE_URL=http://192.168.30.201 --env 'IRONIC_RAMDISK_SSH_KEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIaVOPtr+CGtxC2JqzMiRzY+TVRrCMJn1OiQmA3VMTlq kni@localhost.localdomain' --env IRONIC_AGENT_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f44d877415c320f301e07734247e7263b27b925d65fc79ecebee247dee691990 --env IRONIC_AGENT_PULL_SECRET=<IRONIC_AGENT_PULL_SECRET>
Jan 05 08:11:29 bootstrap sudo[21589]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/podman run -d --net host --privileged --name image-customization --env DEPLOY_ISO=/shared/html/images/ironic-python-agent.iso --env DEPLOY_INITRD=/shared/html/images/ironic-python-agent.initramfs --env IRONIC_BASE_URL=http://192.168.30.201 --env IRONIC_RAMDISK_SSH_KEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIaVOPtr+CGtxC2JqzMiRzY+TVRrCMJn1OiQmA3VMTlq kni@localhost.localdomain --env IRONIC_AGENT_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f44d877415c320f301e07734247e7263b27b925d65fc79ecebee247dee691990 --env
Jan 05 08:11:29 bootstrap sudo[21589]: root : (command continued) IRONIC_AGENT_PULL_SECRET=<IRONIC_AGENT_PULL_SECRET> --env PROVISIONING_INTERFACE=ens4 --env REGISTRIES_CONF_PATH=/tmp/containers/registries.conf --entrypoint ["/image-customization-server", "--nmstate-dir=/tmp/nmstate/", "--images-publish-addr=http://0.0.0.0:8084"] -v /tmp/nmstate/:/tmp/nmstate/ -v ironic:/shared:z -v /etc/containers:/tmp/containers:z quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f15e656e81f2b54f98c4e1c0eb376eacc63bbb7095c0cf743bdcb8fa3b0a6ad
Jan 05 08:11:29 bootstrap sudo[21589]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 05 08:11:29 bootstrap startironic.sh[18301]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f15e656e81f2b54f98c4e1c0eb376eacc63bbb7095c0cf743bdcb8fa3b0a6ad...
Jan 05 08:11:29 bootstrap startironic.sh[18301]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f15e656e81f2b54f98c4e1c0eb376eacc63bbb7095c0cf743bdcb8fa3b0a6ad: error pinging docker registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 127.0.0.1:53: no such host
Jan 05 08:11:29 bootstrap sudo[21589]: pam_unix(sudo:session): session closed for user root
@zaneb The image-customization-controller image pull failure can be fixed by #5513. And the second error mentioned above still exists.
Cool, thanks.
There's a known bug with dual-stack deployments where the Provisioning VIP and API VIP are in different network families - that causes the host to be unable to download the rootfs. So that could be what's happening here (if so #5514 will be the fix).
However, the snippet of log output we can see on the console appears to indicate that we don't have a rootfs specified at all? That was added by openshift/ironic-image#243, which sets it in the environment variable IRONIC_KERNEL_PARAMS
. Is it possible that the irmc driver is not using the [pxe]/pxe_append_params
option config and we need to set it in [irmc]/kernel_append_params
as well?
@dtantsur this looks suspicious: https://opendev.org/openstack/ironic/src/branch/master/ironic/conf/irmc.py#L110
TBH I'd have expected https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L175 to have handled this.
Is it possible that the irmc driver is not using the [pxe]/pxe_append_params option config and we need to set it in [irmc]/kernel_append_params as well?
Thanks for your tip, we will investigate this.
we need to set it in [irmc]/kernel_append_params as well?
We do since the fallback to [pxe] is deprecated, but should work.
@hs0210 do you see this message in your logs? https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L172-L174
Created https://github.com/metal3-io/ironic-image/pull/342 to set kernel_append_params.
do you see this message in your logs? https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L172-L174
@dtantsur No, I didn't see this in ironic-conductor.log. But I saw the following message.
2022-01-06 08:20:50.029 1 WARNING oslo_config.cfg [-] Deprecated: Option "pxe_append_params" from group "pxe" is deprecated. Use option "kernel_append_params" from group "pxe".
I wonder if we're actually getting ''
instead of None
at https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L171
I wonder if we're actually getting '' instead of None at https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/irmc/boot.py#L171
This should not be happening. The default for [irmc]kernel_append_params
is explicitly None
...
@zaneb @dtantsur , Hi, I tested metal3-io/ironic-image#342 . From the ironic.log, [irmc]kernel_append_params
does have the corresponding value:
2022-01-11 07:36:46.355 1 DEBUG oslo_service.service [-] irmc.kernel_append_params = nofb nomodeset vga=normal ipa-insecure=1 sshkey=<sshkey> ip=dhcp coreos.live.rootfs_url=http://192.168.20.2:80/images/ironic-python-agent.rootfs ignition.firstboot ignition.platform.id=metal log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2615
But from the ServerView, the same error still occurs.
Is it possible to capture the whole log so we can see what kernel params are actually making it to the host, and how it is failing to acquire the rootfs image?
@rhjanders: Bugzilla bug referencing is only supported for Pull Requests, not issues.
^ wrong browser tab :)
Is it possible to capture the whole log so we can see what kernel params are actually making it to the host, and how it is failing to acquire the rootfs image?
@zaneb I tried video record, but the log output is jumpy, it is difficult to capture the whole log, I intercepted the following logs that seem to be relevant.
Unfortunately the parameters we're looking for appear after the ssh key, which is being truncated, so we can't actually tell if they're there or not :(
@zaneb Ah, the following is the log after ssh key.
I don't see the rest of that line appearing later, so I think it must have hit some sort of character limit and been truncated.
On the bright side, if we are getting the ssh key we should be getting the rootfs arg. We've seen other problems with trying to use the wrong address family to access the rootfs URL (i.e. IPv6 when the box can only do IPv4, or vice-versa: https://bugzilla.redhat.com/show_bug.cgi?id=2034527)
Also, since the ssh key is populated, you should be able to ssh into the host (as the user core
) and get the full journal.
On the bright side, if we are getting the ssh key we should be getting the rootfs arg.
@zaneb From the follwoing log of ironic-conductor, I saw rootfs arg had been loaded to ramdisk from pxe[kernel_append_params] and this is the complete ironic-conductor.log
2022-01-12 11:19:31.072 1 DEBUG ironic.drivers.modules.pxe_base [req-f5c3af2a-035a-4868-af24-64c1a4796d1b bootstrap-user - - - -]
Ramdisk (i)PXE boot for node 9c7bb22c-2a71-4f91-ae2e-80d6bee03516 has been prepared with kernel params {'deployment_aki_path': '9c7bb22c-2a71-4f91-ae2e-80d6bee03516/deploy_kernel',
'deployment_ari_path': '9c7bb22c-2a71-4f91-ae2e-80d6bee03516/deploy_ramdisk', 'aki_path': 'no_kernel', 'ari_path': 'no_ramdisk', 'pxe_append_params': 'nofb nomodeset vga=normal ipa-insecure=1
sshkey="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIaVOPtr+CGtxC2JqzMiRzY+TVRrCMJn1OiQmA3VMTlq kni@localhost.localdomain"
ip=dhcp coreos.live.rootfs_url=http://192.168.20.2:80/images/ironic-python-agent.rootfs ignition.firstboot
ignition.platform.id=metal ipa-debug=1 ipa-inspection-collectors=default,extra-hardware,logs ipa-enable-vlan-interfaces=all
ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 ipa-inspection-callback-url=http://192.168.20.2:5050/v1/continue ipa-api-url=http://192.168.20.2:6385 ipa-global-request-id=req-f5c3af2a-035a-4868-af24-64c1a4796d1b',
'tftp_server': '192.168.20.2', 'ipxe_timeout': 0, 'ipa-api-url': 'http://192.168.20.2:6385'} prepare_ramdisk /usr/lib/python3.6/site-packages/ironic/drivers/modules/pxe_base.py:224
Also, since the ssh key is populated, you should be able to ssh into the host (as the user core) and get the full journal.
We logged into the host and found the relevant logs, but unfortunately, some parameters are still omitted.
Jan 13 07:59:15 localhost dracut-cmdline[382]: Using kernel command line parameters: ip=auto rd.driver.pre=dm_multipath BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-b0c17d2741adcd571b08d69dbae84aa880469604b4ea9d5e7b6ccbe53c3a3cf2/vmlinuz-4.18.0-305.28.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu ignition.firstboot ostree=/ostree/boot.1/rhcos/b0c17d2741adcd571b08d69dbae84aa880469604b4ea9d5e7b6ccbe53c3a3cf2/0
Jan 13 07:59:15 localhost systemd-journald[344]: Missed 7 kernel messages
@zaneb @dtantsur Hi, I noticed that openshift/ironic-image is already several commits behind metal3-io/ironic-image, I'm not sure if there are some necessary commits that haven't been backported to openshift/ironic-image, could you help take a look?
ip=dhcp coreos.live.rootfs_url=http://192.168.20.2:80/images/ironic-python-agent.rootfs
These options are consistent, so it doesn't look like https://bugzilla.redhat.com/show_bug.cgi?id=2034527
The rootfs is specified (and looks correct) so it isn't an issue with not passing the full kernel args for this driver.
I think we'll need to see the whole journal. However, the log you pasted shows that the actual kernel arguments bear no relation to the ones passed by ironic, e.g. ignition.platform.id=qemu
. I strongly suspect it's booting into an old image from disk, not the one we are trying to PXE boot it from.
However, the log you pasted shows that the actual kernel arguments bear no relation to the ones passed by ironic, e.g. ignition.platform.id=qemu. I strongly suspect it's booting into an old image from disk, not the one we are trying to PXE boot it from.
Very sorry, I pasted a wrong log, the ignition.platform.id=qemu
was from bootstrap's journal. The msater node had not been assigned an IP, so we can't access it with ssh. We can just obtain logs from ServerView, and the log below is the part of the journalctl
command output we intercepted from ServerView.
IMO, the current situation is this:
[pxe]pxe_append_params
had been successfully set.@zaneb @dtantsur We found out that IRONIC_KERNEL_PARAMS
is not set in the httpd
container, since coreos-download
runs after httpd
, the ROOTFS_FILE is not exist, so IRONIC_KERNEL_PARAMS wouldn't be set, resulting in no rootfs parameter available when irmc boot.
We sent #5541 fix this, PTAL.
@zaneb @dtantsur We found out that
IRONIC_KERNEL_PARAMS
is not set in thehttpd
container, sincecoreos-download
runs afterhttpd
, the ROOTFS_FILE is not exist, so IRONIC_KERNEL_PARAMS wouldn't be set, resulting in no rootfs parameter available when irmc boot.We sent #5541 fix this, PTAL.
Thanks, that fix looks good. It will need to be linked to a bugzilla.
How does it work on VMs? Is it possible that some (but only some) drivers use the kernel params from the ironic config file for inspection in preference to those in the inspector.ipxe file?
Thanks, that fix looks good. It will need to be linked to a bugzilla.
@zaneb We are preparing the bugzilla.
How does it work on VMs? Is it possible that some (but only some) drivers use the kernel params from the ironic config file for inspection in preference to those in the inspector.ipxe file?
Ah, very sorry, we don't have a VM test environment.
After fixing this issue, we continue do IPI on irmc and a new error occurred, we have sent issue #5542 to track it, PTAL.
How does it work on VMs? Is it possible that some (but only some) drivers use the kernel params from the ironic config file for inspection in preference to those in the inspector.ipxe file?
Ah, very sorry, we don't have a VM test environment.
What I mean is, without your patch it has been working for us in CI and in QE's testing. Yet your patch would appear to be essential to getting the right parameters into inspector.ipxe. So is there something different about the irmc driver that makes it required only there?
@zaneb According to derekhiggins's comment in #5541, your thought should be right. There's difference between your environment and irmc. The ironic in your test environment should create the pxe config successfully and then use dualboot.ipxe. But the pxe config(pxelinux.cfg/90-1b-0e-e6-7b-88) was not found when irmc boot, so irmc used inspector.ipxe. see details at zhou's comment in #5541
Version
Platform:
baremetal
IPI
What happened?
Initial error phenomenon
Bootstrap VM does not start the ironic related containers.
What we tried
We manually pull
image-customization-controller
container, and after that the ironic related containers can successfully start.Next error phenomenon
The master nodes got stuck in a restart loop and eventually the IPI failed due to inspect timeout.
The following is the error log from .openshift_install.log
The following is the error log from ServerView
What you expected to happen?
IPI successfully run on iRMC.
How to reproduce it (as minimally and precisely as possible)?
create cluster
image-customization-controller
after bootstrap started.Anything else we need to know?
Before OCP 4.10.0-0.nightly-2021-12-14-083101, the IPI can run successfully on iRMC. We are not sure if this error is related to #5473