kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
15.93k stars 6.42k forks source link

Discussion: custom ansible strategy for rolling update of nodes #10497

Open VannTen opened 11 months ago

VannTen commented 11 months ago

What would you like to be added:

A custom ansible strategy plugin, based on the host_pinned strategy, which would be used in the node kubelet upgrade play (and possibly other plays dealing with all the nodes). Described in ansible/ansible#81736 more precisely.

Why is this needed:

  1. The linear strategy waits for all hosts to finish the current tasks. Unless I'm mistaken, kubelet upgrades are independant between nodes and don't need to wait. Thus we're losing time busy-waiting.
  2. Using serial allows a batch upgrade rather than a rolling upgrade, even if we were to use host_pinned with the current play. (host_pinned works only for the current batch as defined by serial). A true rolling upgrade would instead start the play for another node as soon as one has completed it.

Consider the following scenarios (which is not hypothetic, we have clusters doing that):

Scenario A: We have some pods in the cluster with a long start time (15-30 min), which are constrained (with labels) to a particular set of nodes S. These pods have PodDisruptionPolicy to avoid loosing the service (during cluster upgrade, notably). Other pods have more typical startup time (<10s).

Once the first or second batch of nodes are upgraded, some of pods with long start time are at their minimal count according to the PodDisruptionPolicy. Which means when we try to upgrade a node in S in another batch , hosting some of those pods, it blocks for a long time while waiting for the other pods to start before it can safely drain the node (which is good). However, all of the other nodes in the batch are essentially finished with their upgrades, and we wait for nothing.

Scenario B (worse): Two or more nodes in S are in the same batch. The first successfully drains, but not the second (because the PodDisprutionPolicy is now at the minimum number of pods acceptable). This results in a stuck upgrade, because the first node is waiting on the second to complete the task. If it wasn't, it could complete the upgrade, become Schedulable again, allowing the cluster to place new pods and make room for the second done drain. -> this would be solved simply by changing the strategy to host_pinned IMO.

Point 2. is in my opinion the more critical to Kubespray performance in scenarios like those I described, but it implies 1. I raised this issue on Ansible github, the devel mailing list, and matrix, but I didn't get much responses besides the automated issue closure.


I would rather have this in ansible itself, and use it in Kubespray. However, if upstream is not interested, what would you think of integrating this in kubespray ? Is the maintenance worth the (presumed, I haven't tested this concretely) perf uplift ?

(I can implement this myself, whether by copying the free strategy with some tweaks or starting from scratch).

VannTen commented 9 months ago

So, I thought of something which is likely to get a faster ROI: instead of trying to retro-fit an ansible strategy with the "slot" concepts, I'd use the host_pinned strategy coupled with kubernetes leases which would act as "slot reservations". This has the advantage that it easily scale to a "slot-per-group" concept (which would natively support #10591 ) by leveraging group_vars.

Opinions welcomed !

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

VannTen commented 6 months ago

/remove-lifecycle stale /lifecycle frozen