Closed garyyang85 closed 1 year ago
Please attach (or upload to the public file sharing service) must-gather archive
Kubelet master02.taurus.eti.cdl.ibm.com
is not updating its status.
Can you ssh on it? If yes, please attach the output of journalctl -b
from this node. If its unreachable see replacing unhealthy node
@vrutkovs thanks for your help. Here are the logs from unready master and worker node master-unready.log worker-unready.log
Hello @vrutkovs , can you kindly help on this? Or some workarouds I can do is also appreciated.
Hello, could anyone give some advice on this issue? Thanks
Hello, the same problem when upgrading worker nodes from 4.11.0-0.okd-2022-07-29-154152 to 4.12.0-0.okd-2023-03-18-084815, need help or some workaround,
# ssh core@sog-prod-ocpc-ldc-176
Fedora CoreOS 37.20230218.3.0
############################################################################
WARNING: This system is using cgroups v1. For increased reliability
it is strongly recommended to migrate this system and your workloads
to use cgroups v2. For instructions on how to adjust kernel arguments
to use cgroups v2, see:
https://docs.fedoraproject.org/en-US/fedora-coreos/kernel-args/
To disable this warning, use:
sudo systemctl disable coreos-check-cgroups.service
############################################################################
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos
Last login: Tue Jul 4 15:19:31 2023 from 10.68.161.14
[systemd]
Failed Units: 2
kubelet.service
ovs-configuration.service
# systemctl status kubelet.service
× kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
Active: failed (Result: exit-code) since Tue 2023-07-04 15:46:21 UTC; 5min ago
Process: 1573 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS)
Process: 1575 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS)
Process: 1576 ExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state (code=exited, status=0/SUCCESS)
Process: 1579 ExecStart=/usr/local/bin/kubenswrapper /usr/bin/kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --c>
Main PID: 1579 (code=exited, status=127)
CPU: 12ms
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 kubenswrapper[1579]: /usr/local/bin/kubenswrapper: line 5: /usr/bin/kubelet: No such file or directory
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: Starting kubelet.service - Kubernetes Kubelet...
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Main process exited, code=exited, status=127/n/a
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: Failed to start kubelet.service - Kubernetes Kubelet.
Jul 04 15:46:21 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Failed to schedule restart job: Unit crio.service not found.
Jul 04 15:46:21 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Failed with result 'exit-code'.
When reboot to previos ostree shows that:
# rpm-ostree status
State: idle
Deployments:
d34ff10be925c01aad7d088fa2dcc18aaa3e9d7ead12081a18fb6883a94385d7
Version: 37.20230218.3.0 (2023-03-06T20:02:24Z)
Diff: 396 upgraded, 44 removed, 18 added
● pivot://quay.io/openshift/okd-content@sha256:a77c75b002aa480b9dc834a8c0cb38ccce347f528f1d27118c08da2bb2e199b1
CustomOrigin: Managed by machine-config-operator
Version: 411.36.202207291018-0 (2022-07-29T10:22:03Z)
Seems it boots into FCOS deployment, not FCOS+OKD binaries (so it can't find ovs-vsctl
, kubelet
or crio
binaries). What's the output of rpm-ostree status
on this node?
@garyyang85 @vrutkovs I also faced the same issue. When I did ssh into the nodes with status NOT Ready , I found two services (kubelet.service and ovs-configuration.service) failed. I tried to restart kubelet service but I got an error "crio not found". This happened to all the nodes which got upgraded to Fedora 37 during the upgrade OKD (4.11-->4.12).
Also, there was a warning to do migration to cgroup v2.
Hi @vrutkovs The rpm-ostree status response in broken master:
State: idle
Deployments:
● d34ff10be925c01aad7d088fa2dcc18aaa3e9d7ead12081a18fb6883a94385d7
Version: 37.20230218.3.0 (2023-03-06T20:02:24Z)
pivot://quay.io/openshift/okd-content@sha256:a77c75b002aa480b9dc834a8c0cb38ccce347f528f1d27118c08da2bb2e199b1
CustomOrigin: Managed by machine-config-operator
Version: 411.36.202207291018-0 (2022-07-29T10:22:03Z)
The rpm-ostree status response in broken worker:
State: idle
Deployments:
● d34ff10be925c01aad7d088fa2dcc18aaa3e9d7ead12081a18fb6883a94385d7
Version: 37.20230218.3.0 (2023-03-06T20:02:24Z)
pivot://quay.io/openshift/okd-content@sha256:a77c75b002aa480b9dc834a8c0cb38ccce347f528f1d27118c08da2bb2e199b1
CustomOrigin: Managed by machine-config-operator
Version: 411.36.202207291018-0 (2022-07-29T10:22:03Z)
Thanks very much for the help.
Yeah, your current deployment in plain FCOS. You may want to boot into a previous one and let MCD upgrade it to 4.12 again
@vrutkovs Thanks for the reply. Is there a guide how can I "boot into a previous one"? Thanks.
Thank you. Will have a try.
@vrutkovs after rollback, it runs into another issue https://access.redhat.com/solutions/5598401, after operation follow the workaround, it runs into the status in this issue again :(
Describe the bug
I am upgrading okd from 4.11 to 4.12, it is Falling with notice "Cluster operators etcd, kube-apiserver are degraded". I found there are one master node and one work node in not ready status for more than 10 hours. Log in the not ready nodes, and found there is no kubelet and ovs-configuration sevice, some other basice binaries were also lost, like crictl and ovs related files under /usr/bin. The mcp showed machince is unavailable. No problem when upgrade from 4.10 to 4.11. Version
From 4.11.0-0.okd-2022-07-29-154152 to 4.12.0-0.okd-2023-03-18-084815 How reproducible
Upgrade from okd 4.11 to 4.12 Log bundle