Open anik120 opened 2 years ago
Should we convert this to a discussion?
@timflannagan yea it'd be ideal to have this as a discussion, and I can move this to a discussion if we're confident it'll get the same attention as it would if it were an issue. (i.e if we have plans to actively start pulling up the discussion board during issue triage/working group meetings, or any other plans we might have for it).
@anik120 Coming back to this: I've been thinking about this some more, and created https://hackmd.io/lUnrQHaKTsCLZ6j3Q52hSQ# as a result.
Looking good @timflannagan
Move this back to the backlog as it's not required in the immediate term.
This issue has become stale because it has been open 60 days with no activity. The maintainers of this repo will remove this label during issue triage or it will be removed automatically after an update. Adding the lifecycle/frozen
label will cause this issue to ignore lifecycle events.
When a
BundleDeployment
is created, the status conditions can be parsed to answer the question "has my bundle installed successfully, and if not, what has gone wrong during installation".However, if an object from the
Bundle
runs into an issue at a later period of time, eg if the deployment deployed as part of theBundle
has it's pod kicked out when the node it was running on gets killed, and the pod is unscheduleable to any remaining node, theBundleDeployment
object does not have any indication of the "ongoing situation", that could be used as an alert for the cluster admin.Reproducer:
Create a cluster with the following config and install Rukpak in it:
So the cluster will look like:
Create a
BundleDeployment
and wait for it to install successfully:Once installed, start draining the worker nodes and disable scheduling on them:
At this point, both worker nodes are unscheduleable, so the rukpak system deployments, along with supporting deployments (like the cert-manager deployments) needs to be edited to make them scheduleable on the master node to get rukpak up and running again:
To do that, edit the following pods' specs to add the following toleration:
This makes the entire system available, while the combo operator deployment is unavailable:
However, the
BundleDeployment
still showsInstallationSucceeded
, since the first time the install was successfulHow probes can help:
If the
BundleDeployment
controller is aware of the answer "what is the indication, that an object installed from theBundle
is in the desired state?", the controller can use the probes to alert when a previously successfully installed object has deviated from the desired state. An example of a probe for a the deployment in the example could be "Is the .status.conditiontype:Available
true
orfalse
", and if the value isfalse
, theBundleDeployment
InstallState can be changed toInstallationBroken
proactively (instead of inaccurately continuing to sayInstallationSucceeded
.