project-codeflare / codeflare-operator

Operator for installation and lifecycle management of CodeFlare distributed workload stack
Apache License 2.0
7 stars 45 forks source link

Add AppWrapper v1beta2 CRD and controllers to Codeflare operator #543

Closed dgrove-oss closed 6 months ago

dgrove-oss commented 7 months ago

Replaces #491.

This assumes/includes #541.

dgrove-oss commented 7 months ago

rebased to resolve conflict in Dockerfile

Srihari1192 commented 6 months ago

@dgrove-oss Appwrapper instance creation is failing with below error

Error "failed calling webhook "mappwrapper.kb.io": failed to call webhook: the server could not find the requested resource" for field "undefined".

dgrove-oss commented 6 months ago

I make an additional adjustment to the startup logic so that the AppWrapper webhooks will be registered as soon as the certificates are ready if the operator config enables AppWrappers.

However, the AppWrapper CRD is still installed unconditionally. If AppWrapper is disabled in the config, this will result in the user getting a cryptic error like:

Error from server (InternalError): error when creating "../appwrapper/samples/wrapped-pod.yaml": Internal error occurred: failed calling webhook "mappwrapper.kb.io": failed to call webhook: the server could not find the requested resource

when they create or edit an AppWrapper.

If AppWrappers are enabled in the config, then you should see the following behavior (if Kueue is already installed on your cluster, only step 3 below is relevant).

  1. Initial startup of make deploy ENV=e2e (trimming noise from cert rotation). AppWrappers are enabled, but Kueue is not installed in the cluster
2024-05-06T14:44:36Z    INFO    setup   Build info  {"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"}
2024-05-06T14:44:36Z    INFO    setup   setting up health endpoints
2024-05-06T14:44:36Z    INFO    setup   setting up RayCluster controller
2024-05-06T14:44:36Z    INFO    We detected being on Vanilla Kubernetes!
2024-05-06T14:44:36Z    INFO    setup   setting up AppWrapper components
2024-05-06T14:44:36Z    INFO    setup   Workload API not available; setting up waiter for Workload API availability
2024-05-06T14:44:36Z    INFO    setup   starting manager
2024-05-06T14:44:36Z    INFO    controller-runtime.metrics  Starting metrics server
2024-05-06T14:44:36Z    INFO    controller-runtime.metrics  Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-05-06T14:44:36Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-05-06T14:44:36Z    INFO    setup   API workloads.kueue.x-k8s.io not available, setting up retry watcher
2024-05-06T14:44:36Z    INFO    setup   API rayclusters.ray.io not available, setting up retry watcher
2024-05-06T14:44:36Z    INFO    Starting workers    {"controller": "cert-rotator", "worker count": 1}
2024-05-06T14:44:38Z    INFO    setup   Setting up AppWrapper webhook
2024-05-06T14:44:38Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z    INFO    controller-runtime.webhook  Starting webhook server
2024-05-06T14:44:38Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2024-05-06T14:44:38Z    INFO    controller-runtime.webhook  Serving webhook server  {"host": "", "port": 9443}
2024-05-06T14:44:38Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
2024-05-06T14:47:06Z    INFO    admission   Applying defaults   {"webhookGroup": "workload.codeflare.dev", "webhookKind": "AppWrapper", "AppWrapper": {"name":"sample-job","namespace":"default"}, "namespace": "default", "name": "sample-job", "resource": {"group":"workload.codeflare.dev","version":"v1beta2","resource":"appwrappers"}, "user": "kubernetes-admin", "requestID": "53545468-844c-43a0-8bc3-3649a124da80", "job": {"apiVersion": "workload.codeflare.dev/v1beta2", "kind": "AppWrapper", "namespace": "default", "name": "sample-job"}}
2024-05-06T14:47:06Z    INFO    admission   Validating create   {"webhookGroup": "workload.codeflare.dev", "webhookKind": "AppWrapper", "AppWrapper": {"name":"sample-job","namespace":"default"}, "namespace": "default", "name": "sample-job", "resource": {"group":"workload.codeflare.dev","version":"v1beta2","resource":"appwrappers"}, "user": "kubernetes-admin", "requestID": "35624b2e-ebb7-43a1-8989-dc02e826908f", "job": {"apiVersion": "workload.codeflare.dev/v1beta2", "kind": "AppWrapper", "namespace": "default", "name": "sample-job"}}
  1. do a make kueue-e2e to install Kueue; the codeflare operator should restart
2024-05-06T14:51:39Z    INFO    setup   API workloads.kueue.x-k8s.io installed, invoking deferred action
2024-05-06T14:51:39Z    INFO    setup   Workload API now available; triggering controller restart
...
2024-05-06T14:51:39Z    INFO    Wait completed, proceeding to shutdown the manager
  1. On restart, AppWrappers should be fully enabled

2024-05-06T14:52:03Z INFO setup Build info {"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"} 2024-05-06T14:52:03Z INFO setup setting up health endpoints 2024-05-06T14:52:03Z INFO setup setting up RayCluster controller 2024-05-06T14:52:03Z INFO We detected being on Vanilla Kubernetes! 2024-05-06T14:52:03Z INFO setup setting up AppWrapper components 2024-05-06T14:52:03Z INFO setup Workload API available; enabling AppWrappers 2024-05-06T14:52:03Z INFO setup Waiting for certificate generation to complete 2024-05-06T14:52:03Z INFO setup starting manager 2024-05-06T14:52:03Z INFO controller-runtime.metrics Starting metrics server 2024-05-06T14:52:03Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"} 2024-05-06T14:52:03Z INFO controller-runtime.metrics Serving metrics server {"bindAddress": ":8080", "secure": false} 2024-05-06T14:52:03Z INFO setup API rayclusters.ray.io not available, setting up retry watcher 2024-05-06T14:52:03Z INFO cert-rotation starting cert rotator controller 2024-05-06T14:52:03Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: v1.Secret"} 2024-05-06T14:52:03Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: unstructured.Unstructured"} 2024-05-06T14:52:03Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: unstructured.Unstructured"} 2024-05-06T14:52:03Z INFO Starting Controller {"controller": "cert-rotator"} 2024-05-06T14:52:03Z INFO cert-rotation no cert refresh needed 2024-05-06T14:52:03Z INFO cert-rotation certs are ready in /tmp/k8s-webhook-server/serving-certs 2024-05-06T14:52:03Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1} 2024-05-06T14:52:03Z INFO cert-rotation no cert refresh needed 2024-05-06T14:52:03Z INFO cert-rotation Ensuring CA cert {"name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"} 2024-05-06T14:52:03Z INFO cert-rotation Ensuring CA cert {"name": "codeflare-operator-mutating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration", "name": "codeflare-operator-mutating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration"} 2024-05-06T14:52:05Z INFO cert-rotation CA certs are injected to webhooks 2024-05-06T14:52:05Z INFO setup Setting up AppWrapper webhook 2024-05-06T14:52:05Z INFO setup Setting up AppWrapper controller 2024-05-06T14:52:05Z INFO Starting Controller {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"} 2024-05-06T14:52:05Z INFO Starting workers {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta2.AppWrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.webhook Registering webhook {"path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.webhook Registering webhook {"path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta2.AppWrapper"} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta1.Workload"} 2024-05-06T14:52:05Z INFO Starting Controller {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.webhook Starting webhook server 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1beta2.AppWrapper"} 2024-05-06T14:52:05Z INFO Starting EventSource {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: v1.Pod"} 2024-05-06T14:52:05Z INFO Starting Controller {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"} 2024-05-06T14:52:05Z INFO controller-runtime.certwatcher Updated current TLS certificate 2024-05-06T14:52:05Z INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443} 2024-05-06T14:52:05Z INFO controller-runtime.certwatcher Starting certificate watcher 2024-05-06T14:52:05Z INFO Starting workers {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1} 2024-05-06T14:52:05Z INFO Starting workers {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1}

dgrove-oss commented 6 months ago

To document the expectation, if AppWrappers are disabled in the config your log should look like this:

2024-05-06T15:06:36Z    INFO    setup   Build info  {"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"}
2024-05-06T15:06:36Z    INFO    setup   setting up health endpoints
2024-05-06T15:06:36Z    INFO    setup   setting up RayCluster controller
2024-05-06T15:06:36Z    INFO    We detected being on Vanilla Kubernetes!
2024-05-06T15:06:36Z    INFO    setup   setting up AppWrapper components
2024-05-06T15:06:36Z    INFO    setup   AppWrappers are disabled by operator configuration
2024-05-06T15:06:36Z    INFO    setup   starting manager
...
dgrove-oss commented 6 months ago

I made further adjustments. Now if AppWrappers are completely disabled by the config, we setup a webhook that generates an error when AppWrappers are created.

Error from server (Forbidden): error when creating "../appwrapper/samples/wrapped-job.yaml": admission webhook "vappwrapper.kb.io" denied the request: AppWrappers disabled by CodeFlare operator configuration
dgrove-oss commented 6 months ago

I've ported the e2e tests from #491 to this PR as well now.

dgrove-oss commented 6 months ago

rebased and resolved merge conflicts yet again.

Srihari1192 commented 6 months ago

Ray Cluster creation is failing with these changes in openshift cluster with below error

ERROR Failed to update NetworkPolicy {"controller": "codeflare-raycluster-controller", "controllerGroup": "ray.io", "controllerKind": "RayCluster", "RayCluster": {"name":"mnist","namespace":"test-ns-rayupgrade"}, "namespace": "test-ns-rayupgrade", "name": "mnist", "reconcileID": "27f0cfbb-7ef8-43b7-b1ad-bea5e49825d5", "error": "networkpolicies.networking.k8s.io \"mnist-head\" is forbidden: unable to create new content in namespace test-ns-rayupgrade because it is being terminated"} github.com/project-codeflare/codeflare-operator/pkg/controllers.(*RayClusterReconciler).Reconcile /workspace/pkg/controllers/raycluster_controller.go:267 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227 2024-05-15T05:39:08Z ERROR Failed to update NetworkPolicy {"controller": "codeflare-raycluster-controller", "controllerGroup": "ray.io", "controllerKind": "RayCluster", "RayCluster": {"name":"mnist","namespace":"test-ns-rayupgrade"}, "namespace": "test-ns-rayupgrade", "name": "mnist", "reconcileID": "27f0cfbb-7ef8-43b7-b1ad-bea5e49825d5", "error": "networkpolicies.networking.k8s.io \"mnist-workers\" is forbidden: unable to create new content in namespace test-ns-rayupgrade because it is being terminated"} github.com/project-codeflare/codeflare-operator/pkg/controllers.(*RayClusterReconciler).Reconcile /workspace/pkg/controllers/raycluster_controller.go:272 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227

astefanutti commented 6 months ago

@Srihari1192 I don't think that error is neither related to that PR, nor impacts the RayCluster creation, is it? It looks like the namespace where the RayCluster has been created is being terminated, and the operator does not handle yet that case gracefully.

Srihari1192 commented 6 months ago

@Srihari1192 I don't think that error is neither related to that PR, nor impacts the RayCluster creation, is it? It looks like the namespace where the RayCluster has been created is being terminated, and the operator does not handle yet that case gracefully.

yeah it looks like issue with some missing cluster cert configuration when we deploy codeflare operator manually error also say TLS handshake error from 10.128.0.17:43198: remote error: tls: bad certificate.

astefanutti commented 6 months ago

@dgrove-oss #541 has been merged, you'll need to rebase this one last time 🥲.

astefanutti commented 6 months ago

/lgtm

astefanutti commented 6 months ago

/approve

openshift-ci[bot] commented 6 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/project-codeflare/codeflare-operator/blob/main/OWNERS)~~ [astefanutti] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment