nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

ODF operator on nerc-ocp-infra is degraded #688

Closed larsks closed 1 month ago

larsks commented 3 months ago

RH support case: https://access.redhat.com/support/cases/#/case/03908442

Thorsten and Chris were experiencing issues with the acm-metrics-backing-store. They deleted the pods associated with this backing store, and the pods failed to come back. Upon investigation, the odf-operator-controller-manager pod is in a failed state:

$ k get pod odf-operator-controller-manager-5d9ccf4488-w2jz6
NAME                                               READY   STATUS                       RESTARTS   AGE
odf-operator-controller-manager-5d9ccf4488-w2jz6   1/2     CreateContainerConfigError   0          6d16h

Inspecting the container statuses, we see:

$ k get pod odf-operator-controller-manager-5d9ccf4488-w2jz6 -o yaml | yq .status.containerStatuses
[
  {
    "containerID": "cri-o://6998dfb1f50bac0704f946db9e234b9236246b0945f63c1613e626622b9de813",
    "image": "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:77df668a9591bbaae675d0553f8dca5423c0f257317bc08fe821d965f44ed019",
    "imageID": "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:0bf40df05a3599b6ef8706e78bb1914b9f988946543a685449110aaf8b59e8bc",
    "lastState": {},
    "name": "kube-rbac-proxy",
    "ready": true,
    "restartCount": 0,
    "started": true,
    "state": {
      "running": {
        "startedAt": "2024-08-12T23:08:20Z"
      }
    }
  },
  {
    "image": "registry.redhat.io/odf4/odf-rhel9-operator@sha256:b569fbd91f664e952e646d940dd85727db8568658950cb33619a469737d1bbef",
    "imageID": "",
    "lastState": {},
    "name": "manager",
    "ready": false,
    "restartCount": 0,
    "started": false,
    "state": {
      "waiting": {
        "message": "configmap \"odf-operator-manager-config\" not found",
        "reason": "CreateContainerConfigError"
      }
    }
  }
]

And indeed, the odf-operator-manager-config ConfigMap does not exist.

schwesig commented 3 months ago

/CC @computate @schwesig

larsks commented 3 months ago

It looks like the ODF operator is stuck installing:

$ k get csv odf-operator.v4.15.5-rhodf
NAME                         DISPLAY                     VERSION        REPLACES                     PHASE
odf-operator.v4.15.5-rhodf   OpenShift Data Foundation   4.15.5-rhodf   odf-operator.v4.15.4-rhodf   Installing
larsks commented 3 months ago

My theory is that we can grab the missing ConfigMap from the production cluster, where is has this data:

apiVersion: v1
data:
  CSIADDONS_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  CSIADDONS_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  CSIADDONS_SUBSCRIPTION_CHANNEL: stable-4.15
  CSIADDONS_SUBSCRIPTION_NAME: odf-csi-addons-operator
  CSIADDONS_SUBSCRIPTION_PACKAGE: odf-csi-addons-operator
  CSIADDONS_SUBSCRIPTION_STARTINGCSV: odf-csi-addons-operator.v4.15.5-rhodf
  IBM_SUBSCRIPTION_CATALOGSOURCE: certified-operators
  IBM_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  IBM_SUBSCRIPTION_CHANNEL: stable-v1.4
  IBM_SUBSCRIPTION_NAME: ibm-storage-odf-operator
  IBM_SUBSCRIPTION_PACKAGE: ibm-storage-odf-operator
  IBM_SUBSCRIPTION_STARTINGCSV: ibm-storage-odf-operator.v1.4.1
  NOOBAA_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  NOOBAA_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  NOOBAA_SUBSCRIPTION_CHANNEL: stable-4.15
  NOOBAA_SUBSCRIPTION_NAME: mcg-operator
  NOOBAA_SUBSCRIPTION_PACKAGE: mcg-operator
  NOOBAA_SUBSCRIPTION_STARTINGCSV: mcg-operator.v4.15.5-rhodf
  OCS_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  OCS_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  OCS_SUBSCRIPTION_CHANNEL: stable-4.15
  OCS_SUBSCRIPTION_NAME: ocs-operator
  OCS_SUBSCRIPTION_PACKAGE: ocs-operator
  OCS_SUBSCRIPTION_STARTINGCSV: ocs-operator.v4.15.5-rhodf
  controller_manager_config.yaml: |
    apiVersion: controller-runtime.sigs.k8s.io/v1alpha1
    kind: ControllerManagerConfig
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: 127.0.0.1:8080
    leaderElection:
      leaderElect: true
      resourceName: 4fd470de.openshift.io
kind: ConfigMap
metadata:
  labels:
    olm.managed: "true"
    operators.coreos.com/odf-operator.openshift-storage: ""
  name: odf-operator-manager-config
  namespace: openshift-storage

It looks like we will also need the 4fd470de.openshift.io configmap.

larsks commented 3 months ago

@schwesig is going to open a customer support case and ask (a) if they can help figure out how things go into this state in the first place, and (b) if the suggestion in my previous comment seems reasonable.

schwesig commented 3 months ago

problem from earlier: https://access.redhat.com/support/cases/#/case/03861871 was kind of trigger to get deeper into that https://access.redhat.com/support/cases/#/case/03908442

schwesig commented 2 months ago

odf operator update seems to be succesfull now (after maintenance restart Sept 5th). Still the nooba acm-metrica backing store is causing issues.

Update on RH support ticket: the problem seems to be rarely known. (ODF update failing)

schwesig commented 1 month ago

icebox until this is solved https://github.com/nerc-project/operations/issues/745

schwesig commented 1 month ago

the degradation part is solved. we are focussed now on the nooba and scaling down problem. RH support also wants us to open a new ticket because this problem is solved. therefore closing this. in case we get back to this problem after solving the others, we can reopen it/make a new one.