nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

nerc-ocp-test Cluster operator network is degraded #815

Closed computate closed 1 day ago

computate commented 1 week ago

Motivation

The nerc-ocp-test cluster operator network is in a degraded state.

Completion Criteria

Make sure all the cluster operators are Available in the nerc-ocp-test cluster.

Description

Completion dates

Desired - 2024-12-04

computate commented 1 week ago

See the state of the nerc-ocp-test cluster here.

computate commented 1 week ago

The gatekeeper-controller-manager Deployment pods in the gatekeeper-system namespace are failing with an ImagePullBackOff: Back-off pulling image "openpolicyagent/gatekeeper:v3.17.1" error.

computate commented 1 week ago

The issue is docker rate limiting these image pulls Failed to pull image "openpolicyagent/gatekeeper:v3.17.1": reading manifest v3.17.1 in docker.io/openpolicyagent/gatekeeper: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

computate commented 1 week ago

@larsks or @jtriley is there a way to configure a default docker image pull secret for the cluster for someone's docker credentials, or should I temporarily add mine to fix this issue? Open to ideas of best practices for this operator image on Docker.

larsks commented 1 week ago

@computate you can absolutely set a default pull secret for the cluster. However, I think best practice is for each project to set up their own default pull secret in their project namespace, rather than relying on the cluster default.

larsks commented 1 week ago

Updating the cluster default pull secret: https://docs.openshift.com/container-platform/4.17/openshift_images/managing_images/using-image-pull-secrets.html#images-update-global-pull-secret_using-image-pull-secrets

computate commented 1 week ago

@larsks do you know who is the user of the Gatekeeper application in the test cluster?

larsks commented 6 days ago

@computate we use Gatekeeper to manage policies for the rhods-notebooks namespace on the production cluster. @IsaiahStapleton may have used it on the test cluster in order to validate policies before deploying them in production.

computate commented 1 day ago

@larsks thanks for your help resolving these issues on the test cluster.

$ k get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.28   True        False         81d     Cluster version is 4.15.28