okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.76k stars 297 forks source link

machine-config-daemon pod: error reading osImageURL from rpm-ostree #1865

Closed ureiihtur closed 9 months ago

ureiihtur commented 9 months ago

Describe the bug

The oc command can get a machine-config error

oc get clusteroperators.config.openshift.io
NAME                                       VERSION                          AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.0-0.okd-2024-01-06-084517   True        False         False      55m     
baremetal                                  4.14.0-0.okd-2024-01-06-084517   True        False         False      90m     
cloud-controller-manager                   4.14.0-0.okd-2024-01-06-084517   True        False         False      108m    
cloud-credential                           4.14.0-0.okd-2024-01-06-084517   True        False         False      119m    
cluster-autoscaler                         4.14.0-0.okd-2024-01-06-084517   True        False         False      90m     
config-operator                            4.14.0-0.okd-2024-01-06-084517   True        False         False      93m     
console                                    4.14.0-0.okd-2024-01-06-084517   True        False         False      79m     
control-plane-machine-set                  4.14.0-0.okd-2024-01-06-084517   True        False         False      90m     
csi-snapshot-controller                    4.14.0-0.okd-2024-01-06-084517   True        False         False      90m     
dns                                        4.14.0-0.okd-2024-01-06-084517   True        False         False      90m     
etcd                                       4.14.0-0.okd-2024-01-06-084517   True        False         False      89m     
image-registry                             4.14.0-0.okd-2024-01-06-084517   True        False         False      83m     
ingress                                    4.14.0-0.okd-2024-01-06-084517   True        False         False      83m     
insights                                   4.14.0-0.okd-2024-01-06-084517   True        False         False      86m     
kube-apiserver                             4.14.0-0.okd-2024-01-06-084517   True        False         False      86m     
kube-controller-manager                    4.14.0-0.okd-2024-01-06-084517   True        False         False      86m     
kube-scheduler                             4.14.0-0.okd-2024-01-06-084517   True        False         False      87m     
kube-storage-version-migrator              4.14.0-0.okd-2024-01-06-084517   True        False         False      93m     
machine-api                                4.14.0-0.okd-2024-01-06-084517   True        False         False      87m     
machine-approver                           4.14.0-0.okd-2024-01-06-084517   True        False         False      91m     
machine-config                                                              False       True          True       80m     Cluster not available for []: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 3, updated: 3, ready: 0, unavailable: 3)]
marketplace                                4.14.0-0.okd-2024-01-06-084517   True        False         False      90m     
monitoring                                 4.14.0-0.okd-2024-01-06-084517   True        False         False      79m     
network                                    4.14.0-0.okd-2024-01-06-084517   True        False         False      94m     
node-tuning                                4.14.0-0.okd-2024-01-06-084517   True        False         False      88m     
openshift-apiserver                        4.14.0-0.okd-2024-01-06-084517   True        False         False      79m     
openshift-controller-manager               4.14.0-0.okd-2024-01-06-084517   True        False         False      90m     
openshift-samples                          4.14.0-0.okd-2024-01-06-084517   True        False         False      84m     
operator-lifecycle-manager                 4.14.0-0.okd-2024-01-06-084517   True        False         False      91m     
operator-lifecycle-manager-catalog         4.14.0-0.okd-2024-01-06-084517   True        False         False      91m     
operator-lifecycle-manager-packageserver   4.14.0-0.okd-2024-01-06-084517   True        False         False      85m     
service-ca                                 4.14.0-0.okd-2024-01-06-084517   True        False         False      93m     
storage                                    4.14.0-0.okd-2024-01-06-084517   True        False         False      93m

And getting further information, found an osImageURL error in daemon container.

oc get all -n openshift-machine-config-operator 

Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME                                             READY   STATUS             RESTARTS         AGE
pod/machine-config-controller-695c4fbffd-mvvcd   2/2     Running            0                95m
pod/machine-config-daemon-gfq67                  1/2     CrashLoopBackOff   12 (2m36s ago)   38m
pod/machine-config-daemon-gp2j6                  1/2     CrashLoopBackOff   23 (2m6s ago)    95m
pod/machine-config-daemon-p462v                  1/2     CrashLoopBackOff   11 (110s ago)    32m
pod/machine-config-operator-98b8866dd-z4mtd      2/2     Running            0                112m
pod/machine-config-server-b95s2                  1/1     Running            0                94m
pod/machine-config-server-f24s4                  1/1     Running            0                60m
pod/machine-config-server-wgb2p                  1/1     Running            0                94m

NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/machine-config-controller   ClusterIP   172.30.251.89    <none>        9001/TCP   123m
service/machine-config-daemon       ClusterIP   172.30.151.34    <none>        9001/TCP   123m
service/machine-config-operator     ClusterIP   172.30.183.214   <none>        9001/TCP   123m

NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/machine-config-daemon   3         3         0       3            0           kubernetes.io/os=linux            95m
daemonset.apps/machine-config-server   3         3         3       3            3           node-role.kubernetes.io/master=   94m

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/machine-config-controller   1/1     1            1           95m
deployment.apps/machine-config-operator     1/1     1            1           123m

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/machine-config-controller-695c4fbffd   1         1         1       95m
replicaset.apps/machine-config-operator-98b8866dd      1         1         1       123m

oc describe -n openshift-machine-config-operator pod machine-config-daemon-gfq67

Name:                 machine-config-daemon-gfq67
Namespace:            openshift-machine-config-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      machine-config-daemon
Node:                 master0.test.example.com/192.168.101.155
Start Time:           Tue, 23 Jan 2024 16:56:24 +0800
Labels:               controller-revision-hash=6cd8689cf8
                      k8s-app=machine-config-daemon
                      pod-template-generation=1
Annotations:          openshift.io/scc: privileged
Status:               Running
IP:                   192.168.101.155
IPs:
  IP:           192.168.101.155
Controlled By:  DaemonSet/machine-config-daemon
Containers:
  machine-config-daemon:
    Container ID:  cri-o://991d908f4eaeb275f6b7efa98f6f5149c3047348dbc656cfb661ae719facfc22
    Image:         quay.io/openshift/okd-content@sha256:7df1a8d75db145a9f761e1de429d209dc73b21291d791082fa9fbb37231f0dcf
    Image ID:      quay.io/openshift/okd-content@sha256:7df1a8d75db145a9f761e1de429d209dc73b21291d791082fa9fbb37231f0dcf
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/machine-config-daemon
    Args:
      start
      --payload-version=4.14.0-0.okd-2024-01-06-084517
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   I0123 09:37:56.610828   45601 start.go:61] Version: machine-config-daemon-4.6.0-202006240615.p0-2357-g7649b927-dirty (7649b9274cde2fb50a61a579e3891c8ead2d79c5)
I0123 09:37:56.610868   45601 update.go:1962] Running: mount --rbind /run/secrets /rootfs/run/secrets
I0123 09:37:56.611919   45601 update.go:1962] Running: mount --rbind /usr/bin /rootfs/run/machine-config-daemon-bin
I0123 09:37:56.612845   45601 daemon.go:475] assuming we can use container binary chroot() to host
I0123 09:37:56.631099   45601 daemon.go:525] Invoking re-exec /run/bin/machine-config-daemon
I0123 09:37:56.647640   45601 start.go:61] Version: machine-config-daemon-4.6.0-202006240615.p0-2357-g7649b927-dirty (7649b9274cde2fb50a61a579e3891c8ead2d79c5)
E0123 09:37:56.647722   45601 rpm-ostree.go:284] Merged secret file could not be validated; defaulting to cluster pull secret <nil>
I0123 09:37:56.647736   45601 rpm-ostree.go:262] Linking ostree authfile to /var/lib/kubelet/config.json
F0123 09:37:56.729105   45601 start.go:96] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: exit status 1

      Exit Code:    255
      Started:      Tue, 23 Jan 2024 17:37:56 +0800
      Finished:     Tue, 23 Jan 2024 17:37:56 +0800
    Ready:          False
    Restart Count:  13
    Requests:
      cpu:     20m
      memory:  50Mi
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /rootfs from rootfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zgvmv (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://9396339c1caf4b04554e942ad5b5c4444457ad4b7cf6c98015a3a99a3f50ab7a
    Image:         quay.io/openshift/okd-content@sha256:f7567bea7f24184da06c3ed87b396c804342fcfd89f54989622780ce0b7b0724
    Image ID:      quay.io/openshift/okd-content@sha256:d0dc8a4f0230f671d81c5f7e089636fc5cc8c21a5a5d5f0e17c463ffe5f77376
    Port:          9001/TCP
    Host Port:     9001/TCP
    Args:
      --secure-listen-address=0.0.0.0:9001
      --config-file=/etc/kube-rbac-proxy/config-file.yaml
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
      --upstream=http://127.0.0.1:8797
      --logtostderr=true
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:          Running
      Started:      Tue, 23 Jan 2024 16:56:25 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        20m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from mcd-auth-proxy-config (rw)
      /etc/tls/private from proxy-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zgvmv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  rootfs:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  proxy-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  proxy-tls
    Optional:    false
  mcd-auth-proxy-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-rbac-proxy
    Optional:  false
  kube-api-access-zgvmv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  44m                    default-scheduler  Successfully assigned openshift-machine-config-operator/machine-config-daemon-gfq67 to master0.test.example.com
  Normal   Pulled     44m                    kubelet            Container image "quay.io/openshift/okd-content@sha256:f7567bea7f24184da06c3ed87b396c804342fcfd89f54989622780ce0b7b0724" already present on machine
  Normal   Created    44m                    kubelet            Created container kube-rbac-proxy
  Normal   Started    44m                    kubelet            Started container kube-rbac-proxy
  Normal   Pulled     42m (x5 over 44m)      kubelet            Container image "quay.io/openshift/okd-content@sha256:7df1a8d75db145a9f761e1de429d209dc73b21291d791082fa9fbb37231f0dcf" already present on machine
  Normal   Created    42m (x5 over 44m)      kubelet            Created container machine-config-daemon
  Normal   Started    42m (x5 over 44m)      kubelet            Started container machine-config-daemon
  Warning  BackOff    4m13s (x185 over 44m)  kubelet            Back-off restarting failed container machine-config-daemon in pod machine-config-daemon-gfq67_openshift-machine-config-operator(810ef769-ed6c-491e-8b14-3a0078e3c93a)

Version 4.14.0-0.okd-2024-01-06-084517 UPI

How reproducible 100% of the time

Log bundle Log bundle

vrutkovs commented 9 months ago

Do you have ssh access on the nodes? Check if rpm-ostreed.service is running and rpm-ostree status doesn't have pending deployments

ureiihtur commented 9 months ago

Do you have ssh access on the nodes? Check if rpm-ostreed.service is running and rpm-ostree status doesn't have pending deployments

The rpm-ostreed had dependency fail from all of the cluster nodes

rpm-ostree status

A dependency job for rpm-ostreed.service failed. See 'journalctl -xe' for details.
○ rpm-ostreed.service - rpm-ostree System Management Daemon
     Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static)
    Drop-In: /etc/systemd/system/rpm-ostreed.service.d
             └─10-mco-default-env.conf
             /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
             /run/systemd/system/rpm-ostreed.service.d
             └─bug2111817.conf
             /etc/systemd/system/rpm-ostreed.service.d
             └─mco-controlplane-nice.conf
     Active: inactive (dead)
       Docs: man:rpm-ostree(1)

Jan 23 09:43:34 master2.test.yaude.com systemd[1]: Dependency failed for rpm-ostreed.service - rpm-ostree System Management Daemon.
Jan 23 09:43:34 master2.test.yaude.com systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
Jan 23 09:48:45 master2.test.yaude.com systemd[1]: Dependency failed for rpm-ostreed.service - rpm-ostree System Management Daemon.
Jan 23 09:48:45 master2.test.yaude.com systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
Jan 23 09:53:48 master2.test.yaude.com systemd[1]: Dependency failed for rpm-ostreed.service - rpm-ostree System Management Daemon.
Jan 23 09:53:48 master2.test.yaude.com systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
Jan 23 09:59:00 master2.test.yaude.com systemd[1]: Dependency failed for rpm-ostreed.service - rpm-ostree System Management Daemon.
Jan 23 09:59:00 master2.test.yaude.com systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
Jan 23 10:01:30 master2.test.yaude.com systemd[1]: Dependency failed for rpm-ostreed.service - rpm-ostree System Management Daemon.
Jan 23 10:01:30 master2.test.yaude.com systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
error: Loading sysroot: exit status: 1
vrutkovs commented 9 months ago

That's odd, its only dependency is mounted /boot. Check that all mounts have succeeded?

ureiihtur commented 9 months ago

That's odd, its only dependency is mounted /boot. Check that all mounts have succeeded?

It seems like failed mount, can found messages from systemd

Jan 23 07:27:40 localhost systemd[1]: Dependency failed for boot.mount - CoreOS Dynamic Mount for /boot.
Jan 23 07:27:40 localhost systemd[1]: Dependency failed for boot.mount - CoreOS Dynamic Mount for /boot.
Jan 23 09:03:05 bootstrap.test.yaude.com systemd[1]: Dependency failed for boot.mount - CoreOS Dynamic Mount for /boot.
Jan 23 09:03:54 bootstrap.test.yaude.com systemd[1]: Dependency failed for boot.mount - CoreOS Dynamic Mount for /boot.
Jan 23 10:00:59 bootstrap.test.yaude.com systemd[1]: Dependency failed for boot.mount - CoreOS Dynamic Mount for /boot.
Jan 23 10:13:30 bootstrap.test.yaude.com systemd[1]: Dependency failed for boot.mount - CoreOS Dynamic Mount for /boot.

And there is alwasy a message when SSH login, do they relate to each other?

Fedora CoreOS 38.20231027.3.2

############################################################################
WARNING: This system has layered modularity RPMs. In Fedora 39 modularity
has been retired. The system will most likely stop updating successfully
when Fedora CoreOS transitions to Fedora 39. See this tracker for more info:
https://github.com/coreos/fedora-coreos-tracker/issues/1513

To disable this warning, use:
sudo systemctl disable coreos-check-modularity.service
############################################################################
lizhaoyangre commented 4 months ago

I have also encountered this problem. How did you solve it?

ureiihtur commented 4 months ago

I replaced Fedora 39 with 38, the problem is solved.

lizhaoyangre commented 3 months ago

我用 Fedora 39 替换了 Fedora 38,问题解决了。

I have successfully solved this problem according to your method. Thank you very much for your help