Open cgwalters opened 4 years ago
Also strongly related to this is https://github.com/openshift/machine-config-operator/pull/2035 - if anyone wants to e.g. avoid multiple upgrade disruption by scaling up new nodes - today those nodes start at the old config and so will then shortly get upgraded and rebooted. I think other kube distros tend to do upgrades by just creating a new node at the new config and deleting the old node, so they're less subject to this (but that model can't easily apply on bare metal). If we had that PR we could at least have a phase like:
In combination with the above we could allow admins to choose between in-place updates and the burst upgrade strategy at least in cloud.
I will spawn this as a separate issue but openshift should bias the scheduler to prefer nodes with capacity that are “up to date” over nodes that might have more capacity but are “not up to date”, because a not up to date node will eventually need to be drained. That implies there should be a label or other marker on nodes that by convention indicates a node that is “not up to date and likely to be drained in the near future” (which mco would manage) and the scheduler should avoid that without completely ruining other scheduling criteria. This is a soft criteria because it is better to disrupt a workload twice than let the workload dip below the preferred availability of the cluster.
There's a spectrum here - some organizations might want to use this capability to guide node upgrades e.g. to nodes hosting less critical services first.
I think though as long as the "upgrade ordering controller" ensures that at least one control plane machine or one worker are upgradable, then the MCO can generally work fine. We would also have a requirement that for major version upgrades, the control plane is upgraded first. (It seems simpler to apply the constraint across the board that both a control plane and worker are upgradable always)
The precise definition of "upgradable" today comes down to this: https://github.com/openshift/machine-config-operator/blob/ab324326d38747a5a0aface2ba33b066ea4009bf/pkg/controller/node/node_controller.go#L834
There's ongoing work/design on having the MCO actually manage its Upgradable
flag - we should set that to false if we detect the control plane or worker pool drifting too far, which would help here.
Another important detail here is not just the ordering but controlling when any updates are applied. There are a few approaches to that today:
Another big hammer is to create lots of machineconfigpools, perhaps even to the point of a pool per node. This is a pretty extreme scenario for the MCO, I think it would probably work but create
Better approaches:
machineconfig.openshift.io/upgrade-ordering
annotation like none
to mean "don't touch this"* [kubernetes/enhancements#1411](https://github.com/kubernetes/enhancements/pull/1411) ?
One possible implementation using the proposed lease might look like
To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.
To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.
this sounds just great - I've been looking at the lease type since a while now and this seems kind of related and super useful
sounds also like similar to airlock https://github.com/poseidon/fleetlock#manual-intervention
There's clearly overlap between "upgrade weighting" (this initial proposal) and "upgrade blocking/leases". We can clearly stretch "weighting" into including "blocking" per above. It doesn't solve interaction with other components like leases would (and we have that problem in general with machineAPI).
In all of this remember though it's not just about the single reboot - in the general case I think we should handle "accumulated disruption" where pods are rescheduled onto nodes that are then immediately chosen for update.
Well...OTOH leases would solve at least one "extreme" case where pods are pinned to nodes and we want basically per-node locking and updating at most one controlplane/worker at a time; in that case the admin or controller puts a lease on every other node.
Hmm so I guess I am leaning towards leases.
I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.
Unless I'm missing something, this is an anti-pattern and we should move in a different direction.
I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.
Unless I'm missing something, this is an anti-pattern and we should move in a different direction.
This is a story that came up in 4.0 planning. We need a way for users to upgrade X% of their nodes to a given release and let that soak. This is somewhat a different concern than the original issue here.
I'm in fan of informing the scheduler of the kubelet version and having it prioritize the newest ones. Seems like it would be easy to do. 'Do scheduling as normal to find suitable, sort by kubelet version, pick highest in that list'. Of course, if the scheduler only finds the first-fit rather than the best-fit, then this would be a much larger change (but based on how preemption works, possibly works in the former case).
It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated.
Nothing in this proposal requires that all nodes have a priority. Unlabeled nodes could e.g. be 0
, then an admin could use negative values to be after the default, and positive ones before. For example, adding negative priority to a few critical nodes hosting the postgres database.
It also doesn't require planning out how pods move - that's a strongly related but orthogonal thing.
That said, this request is motivated by one request at an extreme to do exactly what you're saying - the admin wants to choose which nodes to update. In this case there isn't concern about interaction with e.g. PDBs because the pods are pinned to nodes.
I agree that marking the node as no longer schedulable is a nice way of identifying that a node is readying undergoing maintenance, I had been looking at a situation where I actually want a pod to schedule to a node, I just dont want the node upgraded. So an ability to cordon a node from an upgrade pool, but not from scheduling is the semantic I was trying to think through.
Another observation:
Should we add a rollout strategy concept on a machine config pool? One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.
Hopefully this comment relates to this particular issue. During a cluster upgrade we (UXD) would like to be able to surface in the console the difference between a node that is 'ready and up to date' vs. a node that is 'ready and waiting to update'. Is there something we could do so that users can quickly distinguish between the two? Perhaps an annotation - waiting to update - or something like that?
One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.
This would increase your exposure to zone-outages, wouldn't it? What if you had a workload that happened to be incompatible with the incoming MachineConfig. You could have PDBs that protect it from excessive degradation during a randomized rollout, but if it gets sqeezed into the old-MachineConfig zone A, zones B and C completely update, and then zone A has an outage, you'd be immediately thrust into an outage for that workload.
Initial implementation in https://github.com/openshift/machine-config-operator/pull/2162 and a separate PR in https://github.com/openshift/machine-config-operator/pull/2163
This could help in preventing nodes from upgrading to a kernel that is not supported by an out-of-tree driver. These special resource nodes can only run the workload. Draining those could kill the workload since the "general" CPU nodes cannot handle them. Customers could then try to build the drivers on the updated nodes without disrupting the special nodes. The question is how many update cycles would this annotated node "survive"?
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
/lifecycle frozen This keeps coming up
Today, any admin that wants to fully control node upgrade ordering can:
First, pause the pool controlling those nodes (e.g. worker
). The node controller will then not apply desiredConfig
annotations on the targeted notes. However, crucially the render controller will still generate new rendered machineconfig.
So for example, when you upgrade the cluster to pick up a new kernel security update, the worker
pool will update to a new spec.configuration
. (Note if you've paused the worker but not control plane, then the control plane will have updated)
Now, if you want to manually target a node for update, it should work to oc edit node/<somenode>
and update the machineconfiguration.openshift.io/desiredConfig
annotation to the current value of spec.configuration
in the worker pool.
The machine-config-daemon (daemonset) running on that node will then react to that annotation, and apply it (drain + reboot as needed) in the same way it would as if the node controller had changed it.
Today the MCO makes no attempt to apply any ordering to which nodes it updates from the candidates. One problem we're thinking about is (particularly on bare metal scenarios where there might be a lot of pods on a node, and possibly pods expensive to reschedule like CNV) that it's quite possible that workloads are disrupted multiple times for an OS upgrade.
When we go to drain a node, its pods will be rescheduled across the remaining nodes...and then we will upgrade one of those, quite possibly moving one of the workload pods again etc.
One idea here is to add the minimal hooks such that a separate controller could influence this today.
If for example we supported a label
machineconfig.openshift.io/upgrade-weight=42
and the node controller picked the highest weight node, then the separate controller could also e.g. mark $number nodes which are next in the upgrade ordering as unschedulable, ensuring that the drain from the current node doesn't land on them.Without excess capacity or changing the scheduler to more strongly prefer packing nodes it seems hard to avoid multiple disruption, but the label would allow this baseline integration.