feature: cluster health init-container similar to how prepare is leveraged for k3s-upgrade

dweomer commented 2 years ago

Is your feature request related to a problem? Please describe. Some upgrade use-cases require that the cluster "be healthy" before incurring the disruption of a node upgrade. It would be nice to configure a Plan such that some settling has occurred before it continues with the next node. This could be achieved by some sort of health measurement, possibly ensuring that all replicasets and daemonsets have a minimum number of pods running, etc.

Describe the solution you'd like A parameter or two on the Plan spec indicating that some health measurement should pass before commencing with node upgrade(s) and what pre-canned strategy to use for making such a determination. Maybe the presence of a strategy choice other than "none" would be enough (so, one parameter).

Describe alternatives you've considered Relying on the eviction algorithm that respects pod disruption budgets (aka NOT specifying .spec.disableEviction) will likely not be adequate for all upgrade needs because such can hang indefinitely in resource-constrained clusters. Because of this we must assume that some disruptions can and will happen from upgrade plan applies. Is this enough to warrant new logic in the controller? :shrug:

Additional context

https://github.com/rancher/system-upgrade-controller/issues/163

psy-q commented 1 year ago

We could use a lightweight version of this where we can at least specify a delay between node upgrades so that we avoid having multiple nodes rebooting before a StatefulSet with 3 pods is ready again. Is there a delay option already that we missed, e.g. 30 minutes between node upgrades?

The issue we have is that this workload is tightly coupled to specific nodes, so if SUC just goes ahead and reboots one after the other, even if the pods could be rescheduled to another node to meet their PDB, they won't be because they need to be scheduled on the exact same node again.

As it takes 15-20 minutes for a pod to become ready and reconnect to its cluster friends, SUC has cheerfully rebooted all three nodes by that time, destroying the application's clustering mode. It can't deal with more than one cluster member being unavailable at any one time.

dweomer commented 1 year ago

We could use a lightweight version of this where we can at least specify a delay between node upgrades so that we avoid having multiple nodes rebooting before a StatefulSet with 3 pods is ready again. Is there a delay option already that we missed, e.g. 30 minutes between node upgrades?

The issue we have is that this workload is tightly coupled to specific nodes, so if SUC just goes ahead and reboots one after the other, even if the pods could be rescheduled to another node to meet their PDB, they won't be because they need to be scheduled on the exact same node again.

As it takes 15-20 minutes for a pod to become ready and reconnect to its cluster friends, SUC has cheerfully rebooted all three nodes by that time, destroying the application's clustering mode. It can't deal with more than one cluster member being unavailable at any one time.

IIRC, SUC will honor existing PDB if such exists.

rancher / system-upgrade-controller

feature: cluster health init-container similar to how prepare is leveraged for k3s-upgrade #169