IPI on bare metal fails for OKD SCOS 4.12 maybe because of broken DNS settings on masters

JM1 commented 1 year ago

When doing IPI on bare metal servers with OKD SCOS 4.12, the deployment fails with:

...
ironic_deployment.openshift-master-deployment[1]: Still creating... [1m30s elapsed]
ironic_deployment.openshift-master-deployment[0]: Creation complete after 1m31s [id=bc56bfb9-8115-4a55-b56a-6c21e62a0132]
ironic_deployment.openshift-master-deployment[2]: Creation complete after 1m31s [id=18e4f18f-a914-4626-b2c9-eacf0e1a134f]
ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m31s [id=3d4e2a00-57e6-4edb-80a9-6365dfe66633]

Apply complete! Resources: 6 added, 0 changed, 0 destroyed.
...
[INFO] running Terraform command: /home/ansible/clusterconfigs/terraform/bin/terraform output -no-color -json
...
OpenShift Installer 4.12.0-0.okd-scos-2023-02-22-083106
Built from commit 7db3e5c61b5dfdd47d772e7042f549202abcbb14
Waiting up to 20m0s (until 3:33PM) for the Kubernetes API at https://api.okd-ipi.home.arpa:6443...
API v1.25.0-2655+18eadcaadf0be7-dirty up
...
Waiting up to 1h0m0s (until 4:13PM) for bootstrapping to complete...
...
Reusing previously-fetched Install Config
Attempted to gather debug logs after installation failure: must provide bootstrap host address
Bootstrap failed to complete: timed out waiting for the condition

All three control plane nodes have invalid DNS settings. Instead of using the DNS nameserver(s) provided by the local DHCP server, all masters have random (?) non-local DNS nameservers listed in /etc/resolv.conf:

$ ssh core@cp0 sudo cat /etc/resolv.conf
# Generated by KNI resolv prepender NM dispatcher script
search okd-ipi.home.arpa
nameserver 192.168.158.29
nameserver 16.182.227.198

$ ssh core@cp1 sudo cat /etc/resolv.conf
# Generated by KNI resolv prepender NM dispatcher script
search okd-ipi.home.arpa
nameserver 192.168.158.30
nameserver 128.154.207.62

$ ssh core@cp2 sudo cat /etc/resolv.conf
# Generated by KNI resolv prepender NM dispatcher script
search okd-ipi.home.arpa
nameserver 192.168.158.31
nameserver 192.23.149.53

The nameservers from network 192.168.158.0/24 listed above are the ip addresses of those nodes retrieved via DHCP. But 16.182.227.198, 128.154.207.62 and 192.23.149.53 are wrong(?), the local dns nameserver which is provided by the DHCP server is 192.168.158.26.

Resolving registry.ci.openshift.org fails due to those broken nameservers:

$ crictl logs "$(crictl ps --name '^coredns$' --output json | jq -r '.containers[] | .id')"
...
[ERROR] plugin/errors: 2 registry.ci.openshift.org. A: read udp 192.168.158.29:58081->16.182.227.198:53: i/o timeout
[ERROR] plugin/errors: 2 registry.ci.openshift.org. AAAA: read udp 192.168.158.29:56732->16.182.227.198:53: i/o timeout
[ERROR] plugin/errors: 2 registry.ci.openshift.org. AAAA: read udp 192.168.158.29:42462->16.182.227.198:53: i/o timeout
[ERROR] plugin/errors: 2 registry.ci.openshift.org. A: read udp 192.168.158.29:40739->16.182.227.198:53: i/o timeout
...

Also systemd-sysusers.service fails on all master nodes (but that might be a red herring):

$ journalctl -u systemd-sysusers.service
systemd[1]: systemd-sysusers.service: Deactivated successfully.
systemd[1]: Stopped Create System Users.
systemd[1]: Starting Create System Users...
systemd-sysusers[753]: Creating group 'sgx' with GID 991.
systemd-sysusers[753]: Creating group 'systemd-oom' with GID 990.
systemd-sysusers[753]: Creating user 'systemd-oom' (systemd Userspace OOM Killer) with UID 990 and GID 990.
systemd-sysusers[753]: /etc/gshadow: Group "sgx" already exists.
systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Create System Users.
systemd[1]: Starting Create System Users...
systemd-sysusers[792]: Creating group 'sgx' with GID 991.
systemd-sysusers[792]: Creating group 'systemd-oom' with GID 990.
systemd-sysusers[792]: Creating user 'systemd-oom' (systemd Userspace OOM Killer) with UID 990 and GID 990.
systemd-sysusers[792]: /etc/gshadow: Group "sgx" already exists.
systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Create System Users.

The same environment and install-config.yaml works fine with OCP RHCOS 4.12 though.

Version

$ openshift-install version
openshift-install 4.12.0-0.okd-scos-2023-02-22-083106
built from commit 7db3e5c61b5dfdd47d772e7042f549202abcbb14
release image registry.ci.openshift.org/origin/release-scos@sha256:eb8e2273a2bc3ea1dd835e131aaab64d50a96280770ff9ca8f6f306f571fca95
release architecture amd64

Platform

IPI on bare metal
3x control plane nodes, 2x worker nodes
with redfish-virtualmedia, without provisioning network
OKD, CentOS Stream Core OS, tested with:
- registry.ci.openshift.org/origin/release-scos:scos-4.12

How reproducible

With Docker Compose and enough RAM, you can reproduce this bug using Ansible hosts lvrt-lcl-session-srv-4* from Ansible collection jm1.cloudy. This will deploy an installer-provisioned OKD cluster based on SCOS in a Docker container and uses QEMU/KVM based virtual machines to simulate bare-metal servers.

The example install-config.yaml used for IPI works fine with OpenShift 4.12. This README.md has instructions on what to change to deploy OpenShift instead of OKD.

This bug is 100% reproducible.

Log bundle

log-bundle-20230222210622.tar.gz

Goose29 commented 1 year ago

I see the same in regards to systemd-sysusers unit on a fresh installation.

I have created a issue for this here: https://github.com/okd-project/okd-scos/issues/9

solacelost commented 1 year ago

I see the same behavior - failed systemd-sysusers and subsequent inexplicable DNS errors - on a bare metal (actual, not virtualized) cluster update from 4.12.0-0.okd-scos-2023-03-23-213604 to 4.12.0-0.okd-scos-2023-04-14-052931. Had to roll back the update as my MCPs wouldn't progress and I couldn't force them to the new image/rendered config. If there is any specific log that would be useful, let me know.

JM1 commented 1 year ago

FYI quay.io/okd/scos-release:4.13.0-0.okd-scos-2023-06-23-041457 deploys just fine. Let me try an older release again..

JM1 commented 1 year ago

quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-04-14-052931 works as well, hmm...

LorbusChris commented 1 year ago

The sysusers issue is like not related here.

Have you checked for selinux failures on the nodes? We seen issues inb OKD/FCOS where NetworkManager wasn't able to run its dispatcher scripts, resulting in similar symptoms.

JM1 commented 1 year ago

Went down the rabbit hole and tested some older releases. With quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-02-14-060109 all master nodes fail to come up: They boot into RHCOS 8 [0], then reboot into SCOS 9 [1] where machine-config-daemon endlessly tries to and fails to pull quay.io/okd/scos-content@sha256:d5eefa3fd604488fd5d0bb0a02c781641867eda582c679b760b62d4b07c40e21 with [2]:

Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: Failed to invoke skopeo proxy method OpenImage: remote error: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:47085->[::1]:53: read: connection refused

Unlike the 4.12.0-0.okd-scos-2023-02-22-083106 build from my initial post, /etc/resolv.conf is correct on master nodes [3]. There is also no audit message indicating a SELinux related failure.

[0]

localhost kernel: Linux version 4.18.0-372.19.1.el8_6.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Mon Jul 18 11:14:02 EDT 2022
localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-20cabae0b778d002086bc1306661abef32aa3e95bbf1dd357fb1dbe44ee174d5/vmlinuz-4.18.0-372.19.1.el8_6.x86_64 ...

[1]

localhost kernel: Linux version 5.14.0-252.el9.x86_64 (mockbuild@x86-05.stream.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-35.el9) #1 SMP PREEMPT_DYNAMIC Wed Feb 1 13:25:18 UTC 2023
localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-2d820677a485683f369f4acf2b5db3e8ef07ac74036c2ca402180e3b039c7650/vmlinuz-5.14.0-252.el9.x86_64 ignition.platform.id=metal console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.0/rhcos/2d820677a485683f369f4acf2b5db3e8ef07ac74036c2ca402180e3>

[2]

cp0 machine-config-daemon[***]: Pulling manifest: ostree-unverified-image:docker://quay.io/okd/scos-content@sha256:d5eefa3fd604488fd5d0bb0a02c781641867eda582c679b760b62d4b07c40e21
cp0 rpm-ostree[***]: Fetching ostree-unverified-image:docker://quay.io/okd/scos-content@sha256:d5eefa3fd604488fd5d0bb0a02c781641867eda582c679b760b62d4b07c40e21
cp0 rpm-ostree[***]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: Failed to invoke skopeo proxy method OpenImage: remote error: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:47085->[::1]:53: read: connection refused
cp0 rpm-ostree[***]: Unlocked sysroot
cp0 rpm-ostree[***]: Process [pid: 30122 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress
cp0 rpm-ostree[***]: client(id:cli dbus:1.1629 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0
cp0 rpm-ostree[***]: In idle state; will auto-exit in 64 seconds

[3]

# Generated by KNI resolv prepender NM dispatcher script
search ocp-ipi.home.arpa
nameserver 192.168.158.29
nameserver 192.168.158.26

JM1 commented 1 year ago

With quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-03-23-213604 both master nodes and worker nodes randomly fail to join the cluster, e.g. on my first run a master node (cp2) was missing, and on a second run a master node (cp2) and a worker node (w1) were missing. This is also different behaviour from what I observed in my initial post with 4.12.0-0.okd-scos-2023-02-22-083106.

For me, it looks like quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-04-14-052931 is the first image where IPI works reliable. Personally I am tempted to close this bug report, but @solacelost also had these DNS issues. @solacelost, could you please test with more recent releases and see if those errors have been fixed for you as well?

okd-project / okd-scos

IPI on bare metal fails for OKD SCOS 4.12 maybe because of broken DNS settings on masters #7