okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.74k stars 295 forks source link

nodes always not ready when upgrade from 4.11 to 4.12 - no kubelet and ovs-configuration.service #1615

Closed garyyang85 closed 1 year ago

garyyang85 commented 1 year ago

Describe the bug

I am upgrading okd from 4.11 to 4.12, it is Falling with notice "Cluster operators etcd, kube-apiserver are degraded". I found there are one master node and one work node in not ready status for more than 10 hours. Log in the not ready nodes, and found there is no kubelet and ovs-configuration sevice, some other basice binaries were also lost, like crictl and ovs related files under /usr/bin. The mcp showed machince is unavailable. No problem when upgrade from 4.10 to 4.11. Version

From 4.11.0-0.okd-2022-07-29-154152 to 4.12.0-0.okd-2023-03-18-084815 How reproducible

Upgrade from okd 4.11 to 4.12 Log bundle

vrutkovs commented 1 year ago

Please attach (or upload to the public file sharing service) must-gather archive

garyyang85 commented 1 year ago

must-gather.tgz

vrutkovs commented 1 year ago

Kubelet master02.taurus.eti.cdl.ibm.com is not updating its status.

Can you ssh on it? If yes, please attach the output of journalctl -b from this node. If its unreachable see replacing unhealthy node

garyyang85 commented 1 year ago

@vrutkovs thanks for your help. Here are the logs from unready master and worker node master-unready.log worker-unready.log

garyyang85 commented 1 year ago

Hello @vrutkovs , can you kindly help on this? Or some workarouds I can do is also appreciated.

garyyang85 commented 1 year ago

Hello, could anyone give some advice on this issue? Thanks

ascherbakov686 commented 1 year ago

Hello, the same problem when upgrading worker nodes from 4.11.0-0.okd-2022-07-29-154152 to 4.12.0-0.okd-2023-03-18-084815, need help or some workaround,

# ssh core@sog-prod-ocpc-ldc-176
Fedora CoreOS 37.20230218.3.0

############################################################################
WARNING: This system is using cgroups v1. For increased reliability
it is strongly recommended to migrate this system and your workloads
to use cgroups v2. For instructions on how to adjust kernel arguments
to use cgroups v2, see:
https://docs.fedoraproject.org/en-US/fedora-coreos/kernel-args/

To disable this warning, use:
sudo systemctl disable coreos-check-cgroups.service
############################################################################

Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos

Last login: Tue Jul  4 15:19:31 2023 from 10.68.161.14
[systemd]
Failed Units: 2
  kubelet.service
  ovs-configuration.service

# systemctl status kubelet.service
× kubelet.service - Kubernetes Kubelet
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
     Active: failed (Result: exit-code) since Tue 2023-07-04 15:46:21 UTC; 5min ago
    Process: 1573 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS)
    Process: 1575 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS)
    Process: 1576 ExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state (code=exited, status=0/SUCCESS)
    Process: 1579 ExecStart=/usr/local/bin/kubenswrapper /usr/bin/kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --c>
   Main PID: 1579 (code=exited, status=127)
        CPU: 12ms

Jul 04 15:46:11 sog-prod-ocpc-ldc-176 kubenswrapper[1579]: /usr/local/bin/kubenswrapper: line 5: /usr/bin/kubelet: No such file or directory
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: Starting kubelet.service - Kubernetes Kubelet...
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Main process exited, code=exited, status=127/n/a
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jul 04 15:46:11 sog-prod-ocpc-ldc-176 systemd[1]: Failed to start kubelet.service - Kubernetes Kubelet.
Jul 04 15:46:21 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Failed to schedule restart job: Unit crio.service not found.
Jul 04 15:46:21 sog-prod-ocpc-ldc-176 systemd[1]: kubelet.service: Failed with result 'exit-code'.

When reboot to previos ostree shows that:

# rpm-ostree status
State: idle
Deployments:
  d34ff10be925c01aad7d088fa2dcc18aaa3e9d7ead12081a18fb6883a94385d7
                  Version: 37.20230218.3.0 (2023-03-06T20:02:24Z)
                     Diff: 396 upgraded, 44 removed, 18 added

● pivot://quay.io/openshift/okd-content@sha256:a77c75b002aa480b9dc834a8c0cb38ccce347f528f1d27118c08da2bb2e199b1
             CustomOrigin: Managed by machine-config-operator
                  Version: 411.36.202207291018-0 (2022-07-29T10:22:03Z)
vrutkovs commented 1 year ago

Seems it boots into FCOS deployment, not FCOS+OKD binaries (so it can't find ovs-vsctl, kubelet or crio binaries). What's the output of rpm-ostree status on this node?

imdmahajankanika commented 1 year ago

@garyyang85 @vrutkovs I also faced the same issue. When I did ssh into the nodes with status NOT Ready , I found two services (kubelet.service and ovs-configuration.service) failed. I tried to restart kubelet service but I got an error "crio not found". This happened to all the nodes which got upgraded to Fedora 37 during the upgrade OKD (4.11-->4.12).

Also, there was a warning to do migration to cgroup v2.

garyyang85 commented 1 year ago

Hi @vrutkovs The rpm-ostree status response in broken master:

State: idle
Deployments:
● d34ff10be925c01aad7d088fa2dcc18aaa3e9d7ead12081a18fb6883a94385d7
                  Version: 37.20230218.3.0 (2023-03-06T20:02:24Z)

  pivot://quay.io/openshift/okd-content@sha256:a77c75b002aa480b9dc834a8c0cb38ccce347f528f1d27118c08da2bb2e199b1
             CustomOrigin: Managed by machine-config-operator
                  Version: 411.36.202207291018-0 (2022-07-29T10:22:03Z)

The rpm-ostree status response in broken worker:

State: idle
Deployments:
● d34ff10be925c01aad7d088fa2dcc18aaa3e9d7ead12081a18fb6883a94385d7
                  Version: 37.20230218.3.0 (2023-03-06T20:02:24Z)

  pivot://quay.io/openshift/okd-content@sha256:a77c75b002aa480b9dc834a8c0cb38ccce347f528f1d27118c08da2bb2e199b1
             CustomOrigin: Managed by machine-config-operator
                  Version: 411.36.202207291018-0 (2022-07-29T10:22:03Z)

Thanks very much for the help.

vrutkovs commented 1 year ago

Yeah, your current deployment in plain FCOS. You may want to boot into a previous one and let MCD upgrade it to 4.12 again

garyyang85 commented 1 year ago

@vrutkovs Thanks for the reply. Is there a guide how can I "boot into a previous one"? Thanks.

vrutkovs commented 1 year ago

See https://docs.fedoraproject.org/en-US/fedora-coreos/manual-rollbacks/#_temporary_rollback

garyyang85 commented 1 year ago

Thank you. Will have a try.

garyyang85 commented 1 year ago

@vrutkovs after rollback, it runs into another issue https://access.redhat.com/solutions/5598401, after operation follow the workaround, it runs into the status in this issue again :(