Closed dgrove-oss closed 6 months ago
rebased to resolve conflict in Dockerfile
@dgrove-oss Appwrapper instance creation is failing with below error
Error "failed calling webhook "mappwrapper.kb.io": failed to call webhook: the server could not find the requested resource" for field "undefined".
I make an additional adjustment to the startup logic so that the AppWrapper webhooks will be registered as soon as the certificates are ready if the operator config enables AppWrappers.
However, the AppWrapper CRD is still installed unconditionally. If AppWrapper is disabled in the config, this will result in the user getting a cryptic error like:
Error from server (InternalError): error when creating "../appwrapper/samples/wrapped-pod.yaml": Internal error occurred: failed calling webhook "mappwrapper.kb.io": failed to call webhook: the server could not find the requested resource
when they create or edit an AppWrapper.
If AppWrappers are enabled in the config, then you should see the following behavior (if Kueue is already installed on your cluster, only step 3 below is relevant).
make deploy ENV=e2e
(trimming noise from cert rotation). AppWrappers are enabled, but Kueue is not installed in the cluster2024-05-06T14:44:36Z INFO setup Build info {"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"}
2024-05-06T14:44:36Z INFO setup setting up health endpoints
2024-05-06T14:44:36Z INFO setup setting up RayCluster controller
2024-05-06T14:44:36Z INFO We detected being on Vanilla Kubernetes!
2024-05-06T14:44:36Z INFO setup setting up AppWrapper components
2024-05-06T14:44:36Z INFO setup Workload API not available; setting up waiter for Workload API availability
2024-05-06T14:44:36Z INFO setup starting manager
2024-05-06T14:44:36Z INFO controller-runtime.metrics Starting metrics server
2024-05-06T14:44:36Z INFO controller-runtime.metrics Serving metrics server {"bindAddress": ":8080", "secure": false}
2024-05-06T14:44:36Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-05-06T14:44:36Z INFO setup API workloads.kueue.x-k8s.io not available, setting up retry watcher
2024-05-06T14:44:36Z INFO setup API rayclusters.ray.io not available, setting up retry watcher
2024-05-06T14:44:36Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1}
2024-05-06T14:44:38Z INFO setup Setting up AppWrapper webhook
2024-05-06T14:44:38Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z INFO controller-runtime.webhook Registering webhook {"path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z INFO controller-runtime.webhook Registering webhook {"path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z INFO controller-runtime.webhook Starting webhook server
2024-05-06T14:44:38Z INFO controller-runtime.certwatcher Updated current TLS certificate
2024-05-06T14:44:38Z INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}
2024-05-06T14:44:38Z INFO controller-runtime.certwatcher Starting certificate watcher
2024-05-06T14:47:06Z INFO admission Applying defaults {"webhookGroup": "workload.codeflare.dev", "webhookKind": "AppWrapper", "AppWrapper": {"name":"sample-job","namespace":"default"}, "namespace": "default", "name": "sample-job", "resource": {"group":"workload.codeflare.dev","version":"v1beta2","resource":"appwrappers"}, "user": "kubernetes-admin", "requestID": "53545468-844c-43a0-8bc3-3649a124da80", "job": {"apiVersion": "workload.codeflare.dev/v1beta2", "kind": "AppWrapper", "namespace": "default", "name": "sample-job"}}
2024-05-06T14:47:06Z INFO admission Validating create {"webhookGroup": "workload.codeflare.dev", "webhookKind": "AppWrapper", "AppWrapper": {"name":"sample-job","namespace":"default"}, "namespace": "default", "name": "sample-job", "resource": {"group":"workload.codeflare.dev","version":"v1beta2","resource":"appwrappers"}, "user": "kubernetes-admin", "requestID": "35624b2e-ebb7-43a1-8989-dc02e826908f", "job": {"apiVersion": "workload.codeflare.dev/v1beta2", "kind": "AppWrapper", "namespace": "default", "name": "sample-job"}}
make kueue-e2e
to install Kueue; the codeflare operator should restart2024-05-06T14:51:39Z INFO setup API workloads.kueue.x-k8s.io installed, invoking deferred action
2024-05-06T14:51:39Z INFO setup Workload API now available; triggering controller restart
...
2024-05-06T14:51:39Z INFO Wait completed, proceeding to shutdown the manager
2024-05-06T14:52:03Z INFO setup Build info {"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"} 2024-05-06T14:52:03Z INFO setup setting up health endpoints 2024-05-06T14:52:03Z INFO setup setting up RayCluster controller 2024-05-06T14:52:03Z INFO We detected being on Vanilla Kubernetes! 2024-05-06T14:52:03Z INFO setup setting up AppWrapper components 2024-05-06T14:52:03Z INFO setup Workload API available; enabling AppWrappers 2024-05-06T14:52:03Z INFO setup Waiting for certificate generation to complete 2024-05-06T14:52:03Z INFO setup starting manager 2024-05-06T14:52:03Z INFO controller-runtime.metrics Starting metrics server 2024-05-06T14:52:03Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"} 2024-05-06T14:52:03Z INFO controller-runtime.metrics Serving metrics server {"bindAddress": ":8080", "secure": false} 2024-05-06T14:52:03Z INFO setup API rayclusters.ray.io not available, setting up retry watcher 2024-05-06T14:52:03Z INFO cert-rotation starting cert rotator controller 2024-05-06T14:52:03Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: v1.Secret"} 2024-05-06T14:52:03Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: unstructured.Unstructured"} 2024-05-06T14:52:03Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: unstructured.Unstructured"} 2024-05-06T14:52:03Z INFO Starting Controller {"controller": "cert-rotator"} 2024-05-06T14:52:03Z INFO cert-rotation no cert refresh needed 2024-05-06T14:52:03Z INFO cert-rotation certs are ready in /tmp/k8s-webhook-server/serving-certs 2024-05-06T14:52:03Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1} 2024-05-06T14:52:03Z INFO cert-rotation no cert refresh needed 2024-05-06T14:52:03Z INFO cert-rotation Ensuring CA cert {"name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"} 2024-05-06T14:52:03Z INFO cert-rotation Ensuring CA cert {"name": "codeflare-operator-mutating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration", "name": "codeflare-operator-mutating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration"} 2024-05-06T14:52:05Z INFO cert-rotation CA certs are injected to webhooks 2024-05-06T14:52:05Z INFO setup Setting up AppWrapper webhook 2024-05-06T14:52:05Z INFO setup Setting up AppWrapper controller 2024-05-06T14:52:05Z INFO Starting Controller {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"} 2024-05-06T14:52:05Z INFO Starting workers {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta2.AppWrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.webhook Registering webhook {"path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.webhook Registering webhook {"path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta2.AppWrapper"} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta1.Workload"} 2024-05-06T14:52:05Z INFO Starting Controller {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.webhook Starting webhook server 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta2.AppWrapper"} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1.Pod"} 2024-05-06T14:52:05Z INFO Starting Controller {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.certwatcher Updated current TLS certificate 2024-05-06T14:52:05Z INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443} 2024-05-06T14:52:05Z INFO controller-runtime.certwatcher Starting certificate watcher 2024-05-06T14:52:05Z INFO Starting workers {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1} 2024-05-06T14:52:05Z INFO Starting workers {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1}
To document the expectation, if AppWrappers are disabled in the config your log should look like this:
2024-05-06T15:06:36Z INFO setup Build info {"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"}
2024-05-06T15:06:36Z INFO setup setting up health endpoints
2024-05-06T15:06:36Z INFO setup setting up RayCluster controller
2024-05-06T15:06:36Z INFO We detected being on Vanilla Kubernetes!
2024-05-06T15:06:36Z INFO setup setting up AppWrapper components
2024-05-06T15:06:36Z INFO setup AppWrappers are disabled by operator configuration
2024-05-06T15:06:36Z INFO setup starting manager
...
I made further adjustments. Now if AppWrappers are completely disabled by the config, we setup a webhook that generates an error when AppWrappers are created.
Error from server (Forbidden): error when creating "../appwrapper/samples/wrapped-job.yaml": admission webhook "vappwrapper.kb.io" denied the request: AppWrappers disabled by CodeFlare operator configuration
I've ported the e2e tests from #491 to this PR as well now.
rebased and resolved merge conflicts yet again.
Ray Cluster creation is failing with these changes in openshift cluster with below error
ERROR Failed to update NetworkPolicy {"controller": "codeflare-raycluster-controller", "controllerGroup": "ray.io", "controllerKind": "RayCluster", "RayCluster": {"name":"mnist","namespace":"test-ns-rayupgrade"}, "namespace": "test-ns-rayupgrade", "name": "mnist", "reconcileID": "27f0cfbb-7ef8-43b7-b1ad-bea5e49825d5", "error": "networkpolicies.networking.k8s.io \"mnist-head\" is forbidden: unable to create new content in namespace test-ns-rayupgrade because it is being terminated"} github.com/project-codeflare/codeflare-operator/pkg/controllers.(*RayClusterReconciler).Reconcile /workspace/pkg/controllers/raycluster_controller.go:267 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227 2024-05-15T05:39:08Z ERROR Failed to update NetworkPolicy {"controller": "codeflare-raycluster-controller", "controllerGroup": "ray.io", "controllerKind": "RayCluster", "RayCluster": {"name":"mnist","namespace":"test-ns-rayupgrade"}, "namespace": "test-ns-rayupgrade", "name": "mnist", "reconcileID": "27f0cfbb-7ef8-43b7-b1ad-bea5e49825d5", "error": "networkpolicies.networking.k8s.io \"mnist-workers\" is forbidden: unable to create new content in namespace test-ns-rayupgrade because it is being terminated"} github.com/project-codeflare/codeflare-operator/pkg/controllers.(*RayClusterReconciler).Reconcile /workspace/pkg/controllers/raycluster_controller.go:272 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227
@Srihari1192 I don't think that error is neither related to that PR, nor impacts the RayCluster creation, is it? It looks like the namespace where the RayCluster has been created is being terminated, and the operator does not handle yet that case gracefully.
@Srihari1192 I don't think that error is neither related to that PR, nor impacts the RayCluster creation, is it? It looks like the namespace where the RayCluster has been created is being terminated, and the operator does not handle yet that case gracefully.
yeah it looks like issue with some missing cluster cert configuration when we deploy codeflare operator manually
error also say TLS handshake error from 10.128.0.17:43198: remote error: tls: bad certificate
.
@dgrove-oss #541 has been merged, you'll need to rebase this one last time 🥲.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: astefanutti
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Replaces #491.
This assumes/includes #541.