Closed JM1 closed 9 months ago
I see the same in regards to systemd-sysusers
unit on a fresh installation.
I have created a issue for this here: https://github.com/okd-project/okd-scos/issues/9
I see the same behavior - failed systemd-sysusers and subsequent inexplicable DNS errors - on a bare metal (actual, not virtualized) cluster update from 4.12.0-0.okd-scos-2023-03-23-213604 to 4.12.0-0.okd-scos-2023-04-14-052931. Had to roll back the update as my MCPs wouldn't progress and I couldn't force them to the new image/rendered config. If there is any specific log that would be useful, let me know.
FYI quay.io/okd/scos-release:4.13.0-0.okd-scos-2023-06-23-041457
deploys just fine. Let me try an older release again..
quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-04-14-052931
works as well, hmm...
The sysusers issue is like not related here.
Have you checked for selinux failures on the nodes? We seen issues inb OKD/FCOS where NetworkManager wasn't able to run its dispatcher scripts, resulting in similar symptoms.
Went down the rabbit hole and tested some older releases. With quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-02-14-060109
all master nodes fail to come up: They boot into RHCOS 8 [0], then reboot into SCOS 9 [1] where machine-config-daemon
endlessly tries to and fails to pull quay.io/okd/scos-content@sha256:d5eefa3fd604488fd5d0bb0a02c781641867eda582c679b760b62d4b07c40e21
with [2]:
Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: Failed to invoke skopeo proxy method OpenImage: remote error: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:47085->[::1]:53: read: connection refused
Unlike the 4.12.0-0.okd-scos-2023-02-22-083106
build from my initial post, /etc/resolv.conf
is correct on master nodes [3]. There is also no audit message indicating a SELinux related failure.
[0]
localhost kernel: Linux version 4.18.0-372.19.1.el8_6.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Mon Jul 18 11:14:02 EDT 2022
localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-20cabae0b778d002086bc1306661abef32aa3e95bbf1dd357fb1dbe44ee174d5/vmlinuz-4.18.0-372.19.1.el8_6.x86_64 ...
[1]
localhost kernel: Linux version 5.14.0-252.el9.x86_64 (mockbuild@x86-05.stream.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-35.el9) #1 SMP PREEMPT_DYNAMIC Wed Feb 1 13:25:18 UTC 2023
localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-2d820677a485683f369f4acf2b5db3e8ef07ac74036c2ca402180e3b039c7650/vmlinuz-5.14.0-252.el9.x86_64 ignition.platform.id=metal console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.0/rhcos/2d820677a485683f369f4acf2b5db3e8ef07ac74036c2ca402180e3>
[2]
cp0 machine-config-daemon[***]: Pulling manifest: ostree-unverified-image:docker://quay.io/okd/scos-content@sha256:d5eefa3fd604488fd5d0bb0a02c781641867eda582c679b760b62d4b07c40e21
cp0 rpm-ostree[***]: Fetching ostree-unverified-image:docker://quay.io/okd/scos-content@sha256:d5eefa3fd604488fd5d0bb0a02c781641867eda582c679b760b62d4b07c40e21
cp0 rpm-ostree[***]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: Failed to invoke skopeo proxy method OpenImage: remote error: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:47085->[::1]:53: read: connection refused
cp0 rpm-ostree[***]: Unlocked sysroot
cp0 rpm-ostree[***]: Process [pid: 30122 uid: 0 unit: machine-config-daemon-firstboot.service] disconnected from transaction progress
cp0 rpm-ostree[***]: client(id:cli dbus:1.1629 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0
cp0 rpm-ostree[***]: In idle state; will auto-exit in 64 seconds
[3]
# Generated by KNI resolv prepender NM dispatcher script
search ocp-ipi.home.arpa
nameserver 192.168.158.29
nameserver 192.168.158.26
With quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-03-23-213604
both master nodes and worker nodes randomly fail to join the cluster, e.g. on my first run a master node (cp2
) was missing, and on a second run a master node (cp2
) and a worker node (w1
) were missing. This is also different behaviour from what I observed in my initial post with 4.12.0-0.okd-scos-2023-02-22-083106
.
For me, it looks like quay.io/okd/scos-release:4.12.0-0.okd-scos-2023-04-14-052931
is the first image where IPI works reliable. Personally I am tempted to close this bug report, but @solacelost also had these DNS issues. @solacelost, could you please test with more recent releases and see if those errors have been fixed for you as well?
When doing IPI on bare metal servers with OKD SCOS 4.12, the deployment fails with:
All three control plane nodes have invalid DNS settings. Instead of using the DNS nameserver(s) provided by the local DHCP server, all masters have random (?) non-local DNS nameservers listed in
/etc/resolv.conf
:The nameservers from network 192.168.158.0/24 listed above are the ip addresses of those nodes retrieved via DHCP. But 16.182.227.198, 128.154.207.62 and 192.23.149.53 are wrong(?), the local dns nameserver which is provided by the DHCP server is 192.168.158.26.
Resolving registry.ci.openshift.org fails due to those broken nameservers:
Also systemd-sysusers.service fails on all master nodes (but that might be a red herring):
The same environment and
install-config.yaml
works fine with OCP RHCOS 4.12 though.Version
Platform
registry.ci.openshift.org/origin/release-scos:scos-4.12
How reproducible
With Docker Compose and enough RAM, you can reproduce this bug using Ansible hosts
lvrt-lcl-session-srv-4*
from Ansible collectionjm1.cloudy
. This will deploy an installer-provisioned OKD cluster based on SCOS in a Docker container and uses QEMU/KVM based virtual machines to simulate bare-metal servers.The example install-config.yaml used for IPI works fine with OpenShift 4.12. This README.md has instructions on what to change to deploy OpenShift instead of OKD.
This bug is 100% reproducible.
Log bundle
log-bundle-20230222210622.tar.gz