okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
Apache License 2.0
1.67k stars 289 forks source link

Agent Installer installation "loses" the dns config at some point and need a manual reboot for rendez-vous host #1906

Open titou10titou10 opened 3 months ago

titou10titou10 commented 3 months ago

OKD version: 4.15.0-0.okd-2024-03-10-010116


I tried to install OKD on bare metal with the agent installer as described here Globally I succeeded but encountered two problems:


Part of the agent-config.yaml:

- 0.pool.ntp.org
- 1.pool.ntp.org
 - hostname: okd5-master1
    role: master
      - name: ens18
        macAddress: aa:bb:cc:dd:ee:63
      deviceName: /dev/sda
        - name: ens18
          type: ethernet
          state: up
          mac-address: aa:bb:cc:dd:ee:63
            enabled: true
            dhcp: true
            auto-dns: false
            auto-gateway: true
            auto-routes: true
            enabled: false
            - denis.prive

The 5 other nodes are on the same pattern


First problem

After having created the iso image etc, all the 5 nodes are started at the same time and the installation starts The progress is monitored with

./openshift-install --dir install agent wait-for install-complete

INFO Host okd5-master2: updated status from insufficient to known (Host is ready to be installed)
INFO Cluster is ready for install
INFO Cluster validation: All hosts in the cluster are ready to install.
INFO Preparing cluster for installation
INFO Host okd5-master2: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)
INFO Host okd5-master3 validation: Host NTP is synced
INFO Host okd5-master2 validation: Host NTP is synced
INFO Host okd5-worker2 validation: Host NTP is synced
INFO Host okd5-worker2: validation 'ntp-synced' is now fixed
INFO Host okd5-worker1 validation: Host NTP is synced
INFO Host okd5-master1 validation: Host NTP is synced
INFO Host okd5-worker1: validation 'ntp-synced' is now fixed
INFO Host okd5-master1: New image status quay.io/openshift/okd-content@sha256:786a746a4cdce34c925e0cf10082a2b9caa27edd9c0bc037272cd8a85f79f922. result: success. time: 4.04 seconds; size: 509.25 Megabytes; download rate: 132.32 MBps
INFO Host okd5-worker1: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
INFO Cluster installation in progress
INFO Host: okd5-master1, reached installation stage Writing image to disk
INFO Host: okd5-master2, reached installation stage Rebooting
INFO Host: okd5-master1, reached installation stage Waiting for control plane: Waiting for bootstrap node preparation
INFO Host: okd5-master1, reached installation stage Waiting for control plane: Waiting for masters to join bootstrap control plane

Then everything stops. The console of okd5-master1 shows that something is looping:

Sans titre2

I then sshed to the node:

  [root@okd5-master1 ~]# podman ps -a
  CONTAINER ID  IMAGE                                                                                                  COMMAND               CREATED        STATUS                    PORTS       NAMES
  a86556f2908e  localhost/podman-pause:4.7.0-1695838680                                                                                      8 minutes ago  Up 7 minutes                          11e0716db4f5-infra
  6eda9b76734b  quay.io/openshift/okd-content@sha256:ae9c813b78902dc4fc99cafd7b8f3d76b06aa11b4205d18f931cf62200a2c6d5  /bin/bash start_d...  7 minutes ago  Up 7 minutes                          assisted-db
  e33f5947e76e  quay.io/openshift/okd-content@sha256:ae9c813b78902dc4fc99cafd7b8f3d76b06aa11b4205d18f931cf62200a2c6d5  /assisted-service     7 minutes ago  Up 7 minutes                          service
  ebf19a760d6d  quay.io/openshift/okd-content@sha256:ae9c813b78902dc4fc99cafd7b8f3d76b06aa11b4205d18f931cf62200a2c6d5  /usr/local/bin/ag...  7 minutes ago  Exited (0) 7 minutes ago              apply-host-config
  85c950aa98b6  quay.io/openshift/okd-content@sha256:57109646c2e66aee05c7003d0e0b7f1538f37a01c2f633fad8e962b3e1727335  next_step_runner ...  7 minutes ago  Up 7 minutes                          next-step-runner
  7d601e5cca4f  quay.io/openshift/okd-content@sha256:786a746a4cdce34c925e0cf10082a2b9caa27edd9c0bc037272cd8a85f79f922  --role bootstrap ...  4 minutes ago  Up 4 minutes                          assisted-installer
  d3e4d0f0bb0c  quay.io/openshift/okd-content@sha256:b4aa05ed09915158bbf554dff010f1a5adde269a8c9a207fae85a8739b627583  start --node-name...  4 minutes ago  Exited (0) 3 minutes ago              suspicious_chandrasekhar

  [root@okd5-master1 ~]#journalctl -xn -u crio | less

  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.388636025Z" level=info msg="Registered SIGHUP reload watcher"
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.389892926Z" level=info msg="Starting seccomp notifier watcher"
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.390031988Z" level=info msg="Create NRI interface"
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.390052759Z" level=info msg="NRI interface is disabled in the configuration."
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.391515863Z" level=info msg="Serving metrics on :9537 via HTTP"
  Mar 21 01:49:39 okd5-master1 systemd[1]: Started crio.service - Container Runtime Interface for OCI (CRI-O).
  ¦¦ Subject: A start job for unit crio.service has finished successfully
  ¦¦ Defined-By: systemd
  ¦¦ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
  ¦¦ A start job for unit crio.service has finished successfully.
  ¦¦ The job identifier is 1618.
  Mar 21 01:49:41 okd5-master1 crio[6495]: time="2024-03-21 01:49:41.209741849Z" level=info msg="Checking image status: quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2" id=4f6e2aaa-4c1b-4252-81d2-851c74658612 name=/runtime.v1.ImageService/ImageStatus
  Mar 21 01:49:41 okd5-master1 crio[6495]: time="2024-03-21 01:49:41.210172943Z" level=info msg="Image quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2 not found" id=4f6e2aaa-4c1b-4252-81d2-851c74658612 name=/runtime.v1.ImageService/ImageStatus
  Mar 21 01:54:41 okd5-master1 crio[6495]: time="2024-03-21 01:54:41.327860027Z" level=info msg="Checking image status: quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2" id=e5855e4d-94d7-4e45-b4e2-aa9bc6ba86d4 name=/runtime.v1.ImageService/ImageStatus
  Mar 21 01:54:41 okd5-master1 crio[6495]: time="2024-03-21 01:54:41.328261482Z" level=info msg="Image quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2 not found" id=e5855e4d-94d7-4e45-b4e2-aa9bc6ba86d4 name=/runtime.v1.ImageService/ImageStatus

  [root@okd5-master1 ~]# ping quay.io
  ping: quay.io: Temporary failure in name resolution

  [root@okd5-master1 ~]# more /etc/resolv.conf
  [root@okd5-master1 ~]#

So the BS node was not able to continue because it could not download image from quay.io because theresolv.confis empty at this stage ! ("Image quay.io/openshift/okd-content@sha256:... not found")

I added the lines from agent-config.yaml in /etc/resolv.conf`and immediatly the installation stops looping and goes on...

    search denis.prive

and the installation of the 4 other nodes continued and succedded etc..

Second problem

Then the installation stopped again and never finished. After waiting a long time (and all nodes at about 5% cpu...), I managed to open an oc session to okd-master1

oc get nodes returned the list of all the nodes as "ready" except the BS node (okd5-master1) that was not even in the list. and of course oc get coand oc get clusterversionindicated that many operators were broken because 1/3 of the masters was missing...

[root@kutils okd5]# oc get nodes -o wide
NAME           STATUS   ROLES                  AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                        KERNEL-VERSION          CONTAINER-RUNTIME
okd5-master2   Ready    control-plane,master   29m   v1.28.7+6e2789b   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2
okd5-master3   Ready    control-plane,master   29m   v1.28.7+6e2789b   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2
okd5-worker1   Ready    worker                 15m   v1.28.7+6e2789b   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2
okd5-worker2   Ready    worker                 15m   v1.28.7+6e2789b   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2

At this point the status is this:

INFO Bootstrap Kube API Initialized
INFO Bootstrap configMap status is complete
INFO cluster bootstrap is complete

So I sshed again in okd5-master1 and force a reboot withshutdown -r nowand tada...the installation of the BS node finished and finally the cluster installation went to the end with all the 5 nodes known to the cluster and "ready"

titou10titou10 commented 3 months ago

"must-gather" direct from okd5-master1 when the installation loops, before editing the empty /etc/resolv.conffile:

ssh core@okd5-master1 sudo /usr/local/bin/agent-gather -O > okd5-master1_agent-gather.tar.gz


"Must-gather" before rebooting, where all nodes are there except the BS node

export KUBECONFIG=...
oc login ...
oc adm must-gather 
