operator-framework / rukpak

RukPak runs in a Kubernetes cluster and defines APIs for installing cloud native content
Apache License 2.0
52 stars 50 forks source link

Consider adding probes for objects unpacked and installed from Bundle, when BundleDeployment is created #419

Open anik120 opened 2 years ago

anik120 commented 2 years ago

When a BundleDeployment is created, the status conditions can be parsed to answer the question "has my bundle installed successfully, and if not, what has gone wrong during installation".

However, if an object from the Bundle runs into an issue at a later period of time, eg if the deployment deployed as part of the Bundle has it's pod kicked out when the node it was running on gets killed, and the pod is unscheduleable to any remaining node, the BundleDeployment object does not have any indication of the "ongoing situation", that could be used as an alert for the cluster admin.

Reproducer:

Create a cluster with the following config and install Rukpak in it:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker

So the cluster will look like:

$ kubectl get nodes 
NAME                   STATUS   ROLES                  AGE   VERSION
rukpak-control-plane   Ready    control-plane,master   4m   v1.23.4
rukpak-worker          Ready    <none>                 4m   v1.23.4
rukpak-worker2         Ready    <none>                 4m   v1.23.4

Create a BundleDeployment and wait for it to install successfully:

$ kubectl apply -f -<<EOF
apiVersion: core.rukpak.io/v1alpha1
kind: BundleDeployment
metadata:
  name: combo
spec:
  provisionerClassName: core.rukpak.io/plain
  template:
    metadata:
      labels:
        app: combo
    spec:
      provisionerClassName: core.rukpak.io/plain
      source:
        image:
          ref: quay.io/operator-framework/combo-bundle:v0.0.1
        type: image
EOF

$ kubectl get bd 
NAME    INSTALLED BUNDLE   INSTALL STATE           AGE
combo   combo-7cdc7d7d6d   InstallationSucceeded       2m

Once installed, start draining the worker nodes and disable scheduling on them:

$ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name --all-namespaces | grep rukpak-worker2
rukpak-worker2         combo-operator-6469d6695d-fjswr
rukpak-worker2         kindnet-f5ddr
rukpak-worker2         kube-proxy-l6lvr
rukpak-worker2         plain-unpack-bundle-combo-7cdc7d7d6d

$ kubectl drain rukpak-worker2 --ignore-daemonsets                                                         
node/rukpak-worker2 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/kindnet-f5ddr, kube-system/kube-proxy-l6lvr
evicting pod rukpak-system/plain-unpack-bundle-combo-7cdc7d7d6d
evicting pod combo/combo-operator-6469d6695d-fjswr
pod/plain-unpack-bundle-combo-7cdc7d7d6d evicted
pod/combo-operator-6469d6695d-fjswr evicted
node/rukpak-worker2 drained

$ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name --all-namespaces | grep rukpak-worker2
rukpak-worker2         kindnet-f5ddr
rukpak-worker2         kube-proxy-l6lvr

$ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name --all-namespaces         
NODE                   NAME
rukpak-worker          cert-manager-5b6d4f8d44-9pfkx
rukpak-worker          cert-manager-cainjector-747cfdfd87-v67b9
rukpak-worker          cert-manager-webhook-67cb765ff6-zxv7p
rukpak-worker          combo-operator-6469d6695d-bbpsn
rukpak-worker          crd-validation-webhook-7bddfb88c-wr9vp
.
.
.
$ k drain rukpak-worker --ignore-daemonsets --delete-emptydir-data
.
.
.
$ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name --all-namespaces 
NODE                   NAME
<none>                 cert-manager-5b6d4f8d44-wkcld
<none>                 cert-manager-cainjector-747cfdfd87-jn8m9
<none>                 cert-manager-webhook-67cb765ff6-kpncb
<none>                 combo-operator-6469d6695d-n2d6f
<none>                 crd-validation-webhook-7bddfb88c-v2dg5
rukpak-control-plane   coredns-64897985d-h9sd8
rukpak-control-plane   coredns-64897985d-pxnjb
rukpak-control-plane   etcd-rukpak-control-plane
rukpak-worker2         kindnet-f5ddr
rukpak-worker          kindnet-lpbjd
rukpak-control-plane   kindnet-zgg8p
rukpak-control-plane   kube-apiserver-rukpak-control-plane
rukpak-control-plane   kube-controller-manager-rukpak-control-plane
rukpak-control-plane   kube-proxy-7x5q4
rukpak-worker          kube-proxy-fxprr
rukpak-worker2         kube-proxy-l6lvr
rukpak-control-plane   kube-scheduler-rukpak-control-plane
rukpak-control-plane   local-path-provisioner-5ddd94ff66-4d59t
<none>                 plain-provisioner-767589cb94-bqlp7
<none>                 rukpak-core-webhook-f6684794-5l9nh

At this point, both worker nodes are unscheduleable, so the rukpak system deployments, along with supporting deployments (like the cert-manager deployments) needs to be edited to make them scheduleable on the master node to get rukpak up and running again:

To do that, edit the following pods' specs to add the following toleration:

tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule
$ kubectl get pods -n cert-manager                                      
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-5b6d4f8d44-wkcld              0/1     Pending   0          5m
cert-manager-cainjector-747cfdfd87-jn8m9   0/1     Pending   0          5m
cert-manager-webhook-67cb765ff6-kpncb      0/1     Pending   0          5m

$  kubectl get pods -n rukpak-system 
NAME                                 READY   STATUS    RESTARTS   AGE
plain-provisioner-767589cb94-bqlp7   0/2     Pending   0          5m
rukpak-core-webhook-f6684794-5l9nh   0/1     Pending   0          5m

$ kubectl get pods -n crdvalidator-system 
NAME                                     READY   STATUS    RESTARTS   AGE
crd-validation-webhook-7bddfb88c-v2dg5   0/1     Pending   0          5m

This makes the entire system available, while the combo operator deployment is unavailable:

$ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name --all-namespaces                           
NODE                   NAME
rukpak-control-plane   cert-manager-5b6d4f8d44-wkcld
rukpak-control-plane   cert-manager-cainjector-747cfdfd87-jn8m9
rukpak-control-plane   cert-manager-webhook-67cb765ff6-kpncb
<none>                 combo-operator-6469d6695d-n2d6f
rukpak-control-plane   crd-validation-webhook-7bddfb88c-v2dg5
rukpak-control-plane   coredns-64897985d-h9sd8
rukpak-control-plane   coredns-64897985d-pxnjb
rukpak-control-plane   etcd-rukpak-control-plane
rukpak-worker2         kindnet-f5ddr
rukpak-worker          kindnet-lpbjd
rukpak-control-plane   kindnet-zgg8p
rukpak-control-plane   kube-apiserver-rukpak-control-plane
rukpak-control-plane   kube-controller-manager-rukpak-control-plane
rukpak-control-plane   kube-proxy-7x5q4
rukpak-worker          kube-proxy-fxprr
rukpak-worker2         kube-proxy-l6lvr
rukpak-control-plane   kube-scheduler-rukpak-control-plane
rukpak-control-plane   local-path-provisioner-5ddd94ff66-4d59t
rukpak-control-plane   plain-provisioner-767589cb94-bqlp7
<none>                 plain-unpack-bundle-combo-7cdc7d7d6d
rukpak-control-plane   rukpak-core-webhook-f6684794-5l9nh

However, the BundleDeployment still shows InstallationSucceeded, since the first time the install was successful

$ kubectl get bd 
NAME    INSTALLED BUNDLE   INSTALL STATE           AGE
combo   combo-7cdc7d7d6d   InstallationSucceeded   70m

How probes can help:

If the BundleDeployment controller is aware of the answer "what is the indication, that an object installed from the Bundle is in the desired state?", the controller can use the probes to alert when a previously successfully installed object has deviated from the desired state. An example of a probe for a the deployment in the example could be "Is the .status.condition type:Available true or false", and if the value is false, the BundleDeployment InstallState can be changed to InstallationBroken proactively (instead of inaccurately continuing to say InstallationSucceeded.

timflannagan commented 2 years ago

Should we convert this to a discussion?

anik120 commented 2 years ago

@timflannagan yea it'd be ideal to have this as a discussion, and I can move this to a discussion if we're confident it'll get the same attention as it would if it were an issue. (i.e if we have plans to actively start pulling up the discussion board during issue triage/working group meetings, or any other plans we might have for it).

timflannagan commented 2 years ago

@anik120 Coming back to this: I've been thinking about this some more, and created https://hackmd.io/lUnrQHaKTsCLZ6j3Q52hSQ# as a result.

anik120 commented 2 years ago

Looking good @timflannagan

exdx commented 2 years ago

Move this back to the backlog as it's not required in the immediate term.

github-actions[bot] commented 2 years ago

This issue has become stale because it has been open 60 days with no activity. The maintainers of this repo will remove this label during issue triage or it will be removed automatically after an update. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.