nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

ODF operator on nerc-ocp-infra is degraded #688

Open larsks opened 3 weeks ago

larsks commented 3 weeks ago

RH support case: https://access.redhat.com/support/cases/#/case/03908442

Thorsten and Chris were experiencing issues with the acm-metrics-backing-store. They deleted the pods associated with this backing store, and the pods failed to come back. Upon investigation, the odf-operator-controller-manager pod is in a failed state:

$ k get pod odf-operator-controller-manager-5d9ccf4488-w2jz6
NAME                                               READY   STATUS                       RESTARTS   AGE
odf-operator-controller-manager-5d9ccf4488-w2jz6   1/2     CreateContainerConfigError   0          6d16h

Inspecting the container statuses, we see:

$ k get pod odf-operator-controller-manager-5d9ccf4488-w2jz6 -o yaml | yq .status.containerStatuses
[
  {
    "containerID": "cri-o://6998dfb1f50bac0704f946db9e234b9236246b0945f63c1613e626622b9de813",
    "image": "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:77df668a9591bbaae675d0553f8dca5423c0f257317bc08fe821d965f44ed019",
    "imageID": "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:0bf40df05a3599b6ef8706e78bb1914b9f988946543a685449110aaf8b59e8bc",
    "lastState": {},
    "name": "kube-rbac-proxy",
    "ready": true,
    "restartCount": 0,
    "started": true,
    "state": {
      "running": {
        "startedAt": "2024-08-12T23:08:20Z"
      }
    }
  },
  {
    "image": "registry.redhat.io/odf4/odf-rhel9-operator@sha256:b569fbd91f664e952e646d940dd85727db8568658950cb33619a469737d1bbef",
    "imageID": "",
    "lastState": {},
    "name": "manager",
    "ready": false,
    "restartCount": 0,
    "started": false,
    "state": {
      "waiting": {
        "message": "configmap \"odf-operator-manager-config\" not found",
        "reason": "CreateContainerConfigError"
      }
    }
  }
]

And indeed, the odf-operator-manager-config ConfigMap does not exist.

schwesig commented 3 weeks ago

/CC @computate @schwesig

larsks commented 3 weeks ago

It looks like the ODF operator is stuck installing:

$ k get csv odf-operator.v4.15.5-rhodf
NAME                         DISPLAY                     VERSION        REPLACES                     PHASE
odf-operator.v4.15.5-rhodf   OpenShift Data Foundation   4.15.5-rhodf   odf-operator.v4.15.4-rhodf   Installing
larsks commented 3 weeks ago

My theory is that we can grab the missing ConfigMap from the production cluster, where is has this data:

apiVersion: v1
data:
  CSIADDONS_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  CSIADDONS_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  CSIADDONS_SUBSCRIPTION_CHANNEL: stable-4.15
  CSIADDONS_SUBSCRIPTION_NAME: odf-csi-addons-operator
  CSIADDONS_SUBSCRIPTION_PACKAGE: odf-csi-addons-operator
  CSIADDONS_SUBSCRIPTION_STARTINGCSV: odf-csi-addons-operator.v4.15.5-rhodf
  IBM_SUBSCRIPTION_CATALOGSOURCE: certified-operators
  IBM_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  IBM_SUBSCRIPTION_CHANNEL: stable-v1.4
  IBM_SUBSCRIPTION_NAME: ibm-storage-odf-operator
  IBM_SUBSCRIPTION_PACKAGE: ibm-storage-odf-operator
  IBM_SUBSCRIPTION_STARTINGCSV: ibm-storage-odf-operator.v1.4.1
  NOOBAA_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  NOOBAA_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  NOOBAA_SUBSCRIPTION_CHANNEL: stable-4.15
  NOOBAA_SUBSCRIPTION_NAME: mcg-operator
  NOOBAA_SUBSCRIPTION_PACKAGE: mcg-operator
  NOOBAA_SUBSCRIPTION_STARTINGCSV: mcg-operator.v4.15.5-rhodf
  OCS_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  OCS_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  OCS_SUBSCRIPTION_CHANNEL: stable-4.15
  OCS_SUBSCRIPTION_NAME: ocs-operator
  OCS_SUBSCRIPTION_PACKAGE: ocs-operator
  OCS_SUBSCRIPTION_STARTINGCSV: ocs-operator.v4.15.5-rhodf
  controller_manager_config.yaml: |
    apiVersion: controller-runtime.sigs.k8s.io/v1alpha1
    kind: ControllerManagerConfig
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: 127.0.0.1:8080
    leaderElection:
      leaderElect: true
      resourceName: 4fd470de.openshift.io
kind: ConfigMap
metadata:
  labels:
    olm.managed: "true"
    operators.coreos.com/odf-operator.openshift-storage: ""
  name: odf-operator-manager-config
  namespace: openshift-storage

It looks like we will also need the 4fd470de.openshift.io configmap.

larsks commented 3 weeks ago

@schwesig is going to open a customer support case and ask (a) if they can help figure out how things go into this state in the first place, and (b) if the suggestion in my previous comment seems reasonable.

schwesig commented 3 weeks ago

problem from earlier: https://access.redhat.com/support/cases/#/case/03861871 was kind of trigger to get deeper into that https://access.redhat.com/support/cases/#/case/03908442

schwesig commented 4 days ago

odf operator update seems to be succesfull now (after maintenance restart Sept 5th). Still the nooba acm-metrica backing store is causing issues.

Update on RH support ticket: the problem seems to be rarely known. (ODF update failing)