openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

image-registry Available: The deployment does not exist.. | Unable to apply 4.16.15: the cluster operator image-registry is not available #29212

Open n00bsi opened 1 month ago

n00bsi commented 1 month ago

[provide a description of the issue]

Version

[provide output of the openshift version or oc version command]

$ oc version
Client Version: 4.15.11
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.14
Kubernetes Version: v1.29.8+f10c92d
Steps To Reproduce
  1. start Update from 4.16.14 to 4.16.15
  2. update hang at 88%
Current Result

Update hat at 88%

image

image

Available: The deployment does not exist NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created


$ oc describe pod -n openshift-image-registry node-ca-5c6gg | grep Node
Node-Selectors:              kubernetes.io/os=linux
  Warning  NodeNotReady  100m (x3 over 5h21m)  node-controller  Node i
```s not ready

$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.14 True True 6d Unable to apply 4.16.15: the cluster operator image-registry is not available

$ oc get clusteroperator image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry False True True 26m Available: The deployment does not exist...

$ oc get pvc ... ... ocs4registry Bound pvc-38960e2f-4c6b-450d-a5fe-c1a26714e496 1Gi RWX longhorn 162 ...

How to fix this ?

$ oc get pods -n openshift-image-registry NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-7c87776c4c-csz22 1/1 Running 0 38m node-ca-5c6gg 1/1 Running 0 38m node-ca-c492l 1/1 Running 0 38m node-ca-crzlc 1/1 Running 0 156m node-ca-dskf6 1/1 Running 0 38m node-ca-mpwjb 1/1 Running 0 38m node-ca-xmjbp 1/1 Running 0 38m



Output of: `oc edit configs.imageregistry.operator.openshift.io -o yaml`

see this attach
[image_reg.yaml.log](https://github.com/user-attachments/files/17459954/image_reg.yaml.log)

##### Expected Result

Update go to the end

##### Additional Information
[try to run `$ oc adm diagnostics` (or `oadm diagnostics`) command if possible]
[if you are reporting issue related to builds, provide build logs with `BUILD_LOGLEVEL=5`]
[consider attaching output of the `$ oc get all -o json -n <namespace>` command to the issue]
[visit https://docs.openshift.org/latest/welcome/index.html]
n00bsi commented 1 month ago

Found some things:

oc describe clusteroperator/machine-config

oc delete pod node-ca-8566c -n openshift-image-registry
and all other node-* pods

oc get pods -n openshift-machine-config-operator 
oc logs -f -n openshift-machine-config-operator machine-config-controller-645db999c6-xjsqs -c machine-config-controller

oc adm drain node1.domain.tld --ignore-daemonsets --force --delete-emptydir-data

https://www.neteye-blog.com/2023/08/debug-and-workarounds-for-a-stuck-update-on-openshift-4-13-6/

https://access.redhat.com/solutions/5317441

https://access.redhat.com/solutions/5598401

Now all Nodes have the same level:

Red Hat Enterprise Linux CoreOS 416.94.202409191851-0

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.15   True        False         112m    Cluster version is 4.16.15