okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.76k stars 297 forks source link

Updating to "4.15.0-0.okd-2024-03-10-010116" from "4.14.0-0.okd-2024-01-26-175629" #2014

Closed ObieBent closed 3 months ago

ObieBent commented 3 months ago

Describe the bug

Unable to apply 4.15.0-0.okd-2024-03-10-010116: wait has exceeded 40 minutes for these operators: machine-config

Version

UPI install method 4.14.0-0.okd-2024-01-26-175629

How reproducible

The upgrade is stuck since 2 days... ..All Cluster Operator have been updated, exceptmachine-config (of course)

NAME                                       VERSION                          AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.15.0-0.okd-2024-03-10-010116   True        False         False      11m
baremetal                                  4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
cloud-controller-manager                   4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
cloud-credential                           4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
cluster-autoscaler                         4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
config-operator                            4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
console                                    4.15.0-0.okd-2024-03-10-010116   True        False         False      37m
control-plane-machine-set                  4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
csi-snapshot-controller                    4.15.0-0.okd-2024-03-10-010116   True        False         False      40d
dns                                        4.15.0-0.okd-2024-03-10-010116   True        False         False      40d
etcd                                       4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
image-registry                             4.15.0-0.okd-2024-03-10-010116   True        False         False      2d11h
ingress                                    4.15.0-0.okd-2024-03-10-010116   True        False         False      2d11h
insights                                   4.15.0-0.okd-2024-03-10-010116   True        False         False      40d
kube-apiserver                             4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
kube-controller-manager                    4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
kube-scheduler                             4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
kube-storage-version-migrator              4.15.0-0.okd-2024-03-10-010116   True        False         False      27h
machine-api                                4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
machine-approver                           4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
machine-config                             4.14.0-0.okd-2024-01-26-175629   True        True          True       27h     Unable to apply 4.15.0-0.okd-2024-03-10-010116: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool infra is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0)]]
marketplace                                4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
monitoring                                 4.15.0-0.okd-2024-03-10-010116   True        False         False      27h
network                                    4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
node-tuning                                4.15.0-0.okd-2024-03-10-010116   True        False         False      27h
openshift-apiserver                        4.15.0-0.okd-2024-03-10-010116   True        False         False      37m
openshift-controller-manager               4.15.0-0.okd-2024-03-10-010116   True        False         False      2d11h
openshift-samples                          4.15.0-0.okd-2024-03-10-010116   True        False         False      2d11h
operator-lifecycle-manager                 4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
operator-lifecycle-manager-catalog         4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
operator-lifecycle-manager-packageserver   4.15.0-0.okd-2024-03-10-010116   True        False         False      40d
service-ca                                 4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
storage                                    4.15.0-0.okd-2024-03-10-010116   True        False         False      47d
Name:         machine-config
Namespace:
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2024-07-01T00:12:11Z
  Generation:          1
  Owner References:
    API Version:     config.openshift.io/v1
    Controller:      true
    Kind:            ClusterVersion
    Name:            version
    UID:             0d23cbeb-28af-4e38-aae2-bb62d5a52858
  Resource Version:  27496717
  UID:               05ae4ecf-092c-43de-919d-66a13ab07d6f
Spec:
Status:
  Conditions:
    Last Transition Time:  2024-08-16T18:00:18Z
    Message:               Working towards 4.15.0-0.okd-2024-03-10-010116
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2024-08-16T18:33:49Z
    Message:               Unable to apply 4.15.0-0.okd-2024-03-10-010116: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool infra is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0)]]
    Reason:                RequiredPoolsFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2024-08-16T17:32:11Z
    Message:               Cluster has deployed [{operator 4.14.0-0.okd-2024-01-26-175629}]
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2024-08-16T18:03:50Z
    Message:               One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading
    Reason:                DegradedPool
    Status:                False
    Type:                  Upgradeable
  Extension:
    Infra:   pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node worker0.bomar.bme.lab is reporting: \"command \\\"/usr/bin/rpm -qf /etc/audit/rules.d/mco-audit-quiet-containers.rules\\\" returned with unexpected error: error: file /etc/audit/rules.d/mco-audit-quiet-containers.rules: Permission denied\\n: exit status 1\""
    Master:  pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node master0.bomar.bme.lab is reporting: \"command \\\"/usr/bin/rpm -qf /etc/audit/rules.d/mco-audit-quiet-containers.rules\\\" returned with unexpected error: error: file /etc/audit/rules.d/mco-audit-quiet-containers.rules: Permission denied\\n: exit status 1\""
    Worker:  pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node worker3.bomar.bme.lab is reporting: \"command \\\"/usr/bin/rpm -qf /etc/audit/rules.d/mco-audit-quiet-containers.rules\\\" returned with unexpected error: error: file /etc/audit/rules.d/mco-audit-quiet-containers.rules: Permission denied\\n: exit status 1\""
  Related Objects:
    Group:
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  controllerconfigs
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  kubeletconfigs
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  containerruntimeconfigs
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  machineconfigs
    Group:
    Name:
    Resource:  nodes
    Group:
    Name:      openshift-kni-infra
    Resource:  namespaces
    Group:
    Name:      openshift-openstack-infra
    Resource:  namespaces
    Group:
    Name:      openshift-ovirt-infra
    Resource:  namespaces
    Group:
    Name:      openshift-vsphere-infra
    Resource:  namespaces
    Group:
    Name:      openshift-nutanix-infra
    Resource:  namespaces
    Group:
    Name:      openshift-cloud-platform-infra
    Resource:  namespaces
  Versions:
    Name:     operator
    Version:  4.14.0-0.okd-2024-01-26-175629
Events:       <none>

All MC have Degraded status.

NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-8461b7c208fd913262052c7193981d72    False     True       True       3              0                   0                     1                      28d
master   rendered-master-f43c37629059b8c27126b806bffb01cd   False     True       True       3              0                   0                     1                      47d
worker   rendered-worker-8461b7c208fd913262052c7193981d72   False     True       True       1              0                   0                     1                      47d
NAME                    STATUS   ROLES                  AGE   VERSION
master0.bomar.bme.lab   Ready    control-plane,master   47d   v1.27.9+e36e183
master1.bomar.bme.lab   Ready    control-plane,master   47d   v1.27.9+e36e183
master2.bomar.bme.lab   Ready    control-plane,master   47d   v1.27.9+e36e183
worker0.bomar.bme.lab   Ready    infra                  40d   v1.27.9+e36e183
worker1.bomar.bme.lab   Ready    infra                  40d   v1.27.9+e36e183
worker2.bomar.bme.lab   Ready    infra                  40d   v1.27.9+e36e183
worker3.bomar.bme.lab   Ready    worker                 40d   v1.27.9+e36e183

Log bundle

The must-gather is too big, ~40MB.

melledouwsma commented 3 months ago

Can you check #1928? This seems to be a duplicate of that discussion.

ObieBent commented 3 months ago

Great, I've applied the workaround, and MCO doesn't complain anymore. Thanks :)