redhat-openshift-ecosystem / community-operators-prod

community-operators metadata backing OpenShift OperatorHub
Apache License 2.0
99 stars 517 forks source link

Operator keep flipping between Succeeded and Installing, due to error with `codeflare-operator-manager` #4572

Closed donovat closed 2 weeks ago

donovat commented 3 months ago

Have installed CodeFlare Operator Version 1.4.1 on two different OpenShift Clusters. In both cases the operator is reporting that it has Succeeded and then flips back to Installing. Looking into the reason reported it states: Status reason installing: waiting for deployment codeflare-operator-manager to become ready: deployment "codeflare-operator-manager" not available: Deployment does not have minimum availability.

Looking into the logs of the codeflare-operator-manager it reports the following:

 2024-05-21T11:01:57Z INFO setup setting up health endpoints
2024-05-21T11:01:57Z INFO setup setting up RayCluster controller
2024-05-21T11:01:57Z INFO We detected being on OpenShift!
2024-05-21T11:01:57Z INFO setup starting manager
2024-05-21T11:01:57Z INFO controller-runtime.metrics Starting metrics server
2024-05-21T11:01:57Z INFO controller-runtime.metrics Serving metrics server {"bindAddress": ":8080", "secure": false}
2024-05-21T11:01:57Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-05-21T11:01:57Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-05-21T11:01:57Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-21T11:01:57Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-21T11:01:57Z INFO cert-rotation starting cert rotator controller
2024-05-21T11:01:57Z INFO Starting Controller {"controller": "cert-rotator"}
2024-05-21T11:01:57Z INFO cert-rotation no cert refresh needed
2024-05-21T11:01:57Z INFO cert-rotation certs are ready in /tmp/k8s-webhook-server/serving-certs
2024-05-21T11:01:57Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1}
2024-05-21T11:01:57Z INFO cert-rotation no cert refresh needed
2024-05-21T11:01:57Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io \"codeflare-operator-validating-webhook-configuration\" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
/opt/app-root/src/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:816
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
/opt/app-root/src/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:785
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2024-05-21T11:01:57Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1}
2024-05-21T11:01:57Z INFO cert-rotation no cert refresh needed
2024-05-21T11:01:57Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io \"codeflare-operator-validating-webhook-configuration\" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
/opt/app-root/src/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:816
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
/opt/app-root/src/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:785
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-21T11:01:57Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "codeflare-operator-mutating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration", "error": "MutatingWebhookConfiguration.admissionregistration.k8s.io \"codeflare-operator-mutating-webhook-configuration\" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
/opt/app-root/src/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:816
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
/opt/app-root/src/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:785
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-21T11:01:58Z INFO setup Waiting for certificate generation to complete
2024-05-21T11:01:59Z INFO cert-rotation CA certs are injected to webhooks
2024-05-21T11:01:59Z INFO setup Certs ready
2024-05-21T11:01:59Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "ray.io/v1, Kind=RayCluster", "path": "/mutate-ray-io-v1-raycluster"}
2024-05-21T11:01:59Z INFO controller-runtime.webhook Starting webhook server
2024-05-21T11:01:59Z INFO controller-runtime.webhook Registering webhook {"path": "/mutate-ray-io-v1-raycluster"}
2024-05-21T11:01:59Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "ray.io/v1, Kind=RayCluster", "path": "/validate-ray-io-v1-raycluster"}
2024-05-21T11:01:59Z INFO controller-runtime.webhook Registering webhook {"path": "/validate-ray-io-v1-raycluster"}
2024-05-21T11:01:59Z INFO controller-runtime.certwatcher Updated current TLS certificate
2024-05-21T11:01:59Z INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}
2024-05-21T11:01:59Z INFO controller-runtime.certwatcher Starting certificate watcher
2024-05-21T11:01:59Z INFO Starting EventSource {"controller": "codeflare-raycluster-controller", "controllerGroup": "ray.io", "controllerKind": "RayCluster", "source": "kind source: *v1.RayCluster"}
2024-05-21T11:01:59Z INFO Starting Controller {"controller": "codeflare-raycluster-controller", "controllerGroup": "ray.io", "controllerKind": "RayCluster"}
2024-05-21T11:01:59Z ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: ray.io/v1: the server could not find the requested resource"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1

The error: "error": "failed to get API group resources: unable to retrieve the complete list of server APIs: ray.io/v1: the server could not find the requested resource"} Seems to keep repeating itself, resulting in the pod going into crashLoopBackOff and restarting. I have compared the crd for ray between these clusters and another and can see nothing different, although the working cluster is an older version of Codeflare.

Any ideas?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity.

donovat commented 2 months ago

This issue is still active, and should remain open. Pity no one from the team has been able to look at it, or make suggestions on what could be causing the problem.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 30 days since being marked as stale.