nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

nerc-ocp-infra clustersecretstore is offline #603

Closed larsks closed 3 weeks ago

larsks commented 3 weeks ago

It looks like the nerc-cluster-secrets ClusterSecretStore on nerc-ocp-infra is offline:

NAME                   AGE    STATUS                  CAPABILITIES   READY
nerc-cluster-secrets   501d   InvalidProviderConfig   ReadWrite      False

This means nerc-ocp-infra won't be getting secret updates and won't be able to retrieve new secrets.

larsks commented 3 weeks ago

There was apparently a more general problem with vault failing to authenticate service accounts from the nerc-ocp-infra cluster; the vault backup jobs were also failing:

$ k -n vault get pod |grep backup
NAME                                        READY   STATUS      RESTARTS        AGE
backup-vault-run-26rrl-pod-4xn2b            0/3     Error       0               42h
backup-vault-run-j2j9k-pod-xdw7n            0/3     Error       0               6h50m
backup-vault-run-lb26j-pod-fvtt4            0/3     Error       0               30h
backup-vault-run-lt2pd-pod-bbzw8            0/3     Error       0               18h

I think something must have happened when I ran the configure-vault job earlier this week in order to activate new service account token for the hypershift cluster.

I wasn't able to identify a root cause, but the solution was...re-running the configure-vault job. For the record, that's:

kubectl get job configure-vault -o yaml |
  yq '
    del(.status)|
    del(.metadata.annotations)|
    del(.spec.selector)|
    del(.spec.template.metadata.labels."controller-uid")
  ' > job.yaml
kubectl delete job configure-vault
kubectl create -f job.yaml
larsks commented 3 weeks ago

The ClusterSecretStore is now healthy:

$ k get clustersecretstore
NAME                   AGE    STATUS   CAPABILITIES   READY
nerc-cluster-secrets   502d   Valid    ReadWrite      True