Allow administrators to guide upgrade ordering

cgwalters commented 4 years ago

Today the MCO makes no attempt to apply any ordering to which nodes it updates from the candidates. One problem we're thinking about is (particularly on bare metal scenarios where there might be a lot of pods on a node, and possibly pods expensive to reschedule like CNV) that it's quite possible that workloads are disrupted multiple times for an OS upgrade.

When we go to drain a node, its pods will be rescheduled across the remaining nodes...and then we will upgrade one of those, quite possibly moving one of the workload pods again etc.

One idea here is to add the minimal hooks such that a separate controller could influence this today.

If for example we supported a label machineconfig.openshift.io/upgrade-weight=42 and the node controller picked the highest weight node, then the separate controller could also e.g. mark $number nodes which are next in the upgrade ordering as unschedulable, ensuring that the drain from the current node doesn't land on them.

Without excess capacity or changing the scheduler to more strongly prefer packing nodes it seems hard to avoid multiple disruption, but the label would allow this baseline integration.

cgwalters commented 4 years ago

Also strongly related to this is https://github.com/openshift/machine-config-operator/pull/2035 - if anyone wants to e.g. avoid multiple upgrade disruption by scaling up new nodes - today those nodes start at the old config and so will then shortly get upgraded and rebooted. I think other kube distros tend to do upgrades by just creating a new node at the new config and deleting the old node, so they're less subject to this (but that model can't easily apply on bare metal). If we had that PR we could at least have a phase like:

upgrade one node
some sort of linkup between MCO and machineAPI says "OK I want some upgrade burst capacity" and scales up the worker machineset, which now has new nodes at the new config
old nodes could be targeted for deletion instead of being upgraded

In combination with the above we could allow admins to choose between in-place updates and the burst upgrade strategy at least in cloud.

smarterclayton commented 4 years ago

I will spawn this as a separate issue but openshift should bias the scheduler to prefer nodes with capacity that are “up to date” over nodes that might have more capacity but are “not up to date”, because a not up to date node will eventually need to be drained. That implies there should be a label or other marker on nodes that by convention indicates a node that is “not up to date and likely to be drained in the near future” (which mco would manage) and the scheduler should avoid that without completely ruining other scheduling criteria. This is a soft criteria because it is better to disrupt a workload twice than let the workload dip below the preferred availability of the cluster.

cgwalters commented 4 years ago

There's a spectrum here - some organizations might want to use this capability to guide node upgrades e.g. to nodes hosting less critical services first.

I think though as long as the "upgrade ordering controller" ensures that at least one control plane machine or one worker are upgradable, then the MCO can generally work fine. We would also have a requirement that for major version upgrades, the control plane is upgraded first. (It seems simpler to apply the constraint across the board that both a control plane and worker are upgradable always)

The precise definition of "upgradable" today comes down to this: https://github.com/openshift/machine-config-operator/blob/ab324326d38747a5a0aface2ba33b066ea4009bf/pkg/controller/node/node_controller.go#L834

There's ongoing work/design on having the MCO actually manage its Upgradable flag - we should set that to false if we detect the control plane or worker pool drifting too far, which would help here.

cgwalters commented 4 years ago

Another important detail here is not just the ordering but controlling when any updates are applied. There are a few approaches to that today:

Marking a node unschedulable will mean the MCO won't upgrade it (controlling schedulability is necessary to avoid multiple disruption until https://github.com/openshift/machine-config-operator/issues/2059#issuecomment-686777621 )
Pausing the machineconfig pool

Another big hammer is to create lots of machineconfigpools, perhaps even to the point of a pool per node. This is a pretty extreme scenario for the MCO, I think it would probably work but create

Better approaches:

Add a special value for the machineconfig.openshift.io/upgrade-ordering annotation like none to mean "don't touch this"
https://github.com/kubernetes/enhancements/pull/1411 ?

michaelgugino commented 4 years ago

* [kubernetes/enhancements#1411](https://github.com/kubernetes/enhancements/pull/1411) ?

One possible implementation using the proposed lease might look like

Prior to upgrading, all nodes that may only be upgraded during a maintenance window have the maintenance lease acquired by either the administrator or a controller that the user provides (which we may provide in the future, bare metal wants to build something very similar). This may be part of an upgrade-specific pre-flight that administrators perform on their cluster, or this lease may be acquired 100% of the time (though this would be sort of an anti-pattern), this is up to the cluster administrator.
MCO respects the lease. If someone other than the MCO holds the lease, MCO takes no action on that host.
Administrator or controller releases the lease on one or more nodes at a time they see fit.
Upon next reconcile, MCO acquires the lease for one or more nodes.
MCO performs upgrade.
MCO releases the lease for one or more nodes.

To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.

runcom commented 4 years ago

To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components.

this sounds just great - I've been looking at the lease type since a while now and this seems kind of related and super useful

runcom commented 4 years ago

sounds also like similar to airlock https://github.com/poseidon/fleetlock#manual-intervention

cgwalters commented 4 years ago

There's clearly overlap between "upgrade weighting" (this initial proposal) and "upgrade blocking/leases". We can clearly stretch "weighting" into including "blocking" per above. It doesn't solve interaction with other components like leases would (and we have that problem in general with machineAPI).

In all of this remember though it's not just about the single reboot - in the general case I think we should handle "accumulated disruption" where pods are rescheduled onto nodes that are then immediately chosen for update.

Well...OTOH leases would solve at least one "extreme" case where pods are pinned to nodes and we want basically per-node locking and updating at most one controlplane/worker at a time; in that case the admin or controller puts a lease on every other node.

Hmm so I guess I am leaning towards leases.

crawford commented 4 years ago

I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.

Unless I'm missing something, this is an anti-pattern and we should move in a different direction.

michaelgugino commented 4 years ago

I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention.

Unless I'm missing something, this is an anti-pattern and we should move in a different direction.

This is a story that came up in 4.0 planning. We need a way for users to upgrade X% of their nodes to a given release and let that soak. This is somewhat a different concern than the original issue here.

I'm in fan of informing the scheduler of the kubelet version and having it prioritize the newest ones. Seems like it would be easy to do. 'Do scheduling as normal to find suitable, sort by kubelet version, pick highest in that list'. Of course, if the scheduler only finds the first-fit rather than the best-fit, then this would be a much larger change (but based on how preemption works, possibly works in the former case).

cgwalters commented 4 years ago

It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated.

Nothing in this proposal requires that all nodes have a priority. Unlabeled nodes could e.g. be 0, then an admin could use negative values to be after the default, and positive ones before. For example, adding negative priority to a few critical nodes hosting the postgres database.

It also doesn't require planning out how pods move - that's a strongly related but orthogonal thing.

That said, this request is motivated by one request at an extreme to do exactly what you're saying - the admin wants to choose which nodes to update. In this case there isn't concern about interaction with e.g. PDBs because the pods are pinned to nodes.

derekwaynecarr commented 4 years ago

I agree that marking the node as no longer schedulable is a nice way of identifying that a node is readying undergoing maintenance, I had been looking at a situation where I actually want a pod to schedule to a node, I just dont want the node upgraded. So an ability to cordon a node from an upgrade pool, but not from scheduling is the semantic I was trying to think through.

derekwaynecarr commented 4 years ago

Another observation:

Should we add a rollout strategy concept on a machine config pool? One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.

megan-hall commented 4 years ago

Hopefully this comment relates to this particular issue. During a cluster upgrade we (UXD) would like to be able to surface in the console the difference between a node that is 'ready and up to date' vs. a node that is 'ready and waiting to update'. Is there something we could do so that users can quickly distinguish between the two? Perhaps an annotation - waiting to update - or something like that?

wking commented 4 years ago

One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware.

This would increase your exposure to zone-outages, wouldn't it? What if you had a workload that happened to be incompatible with the incoming MachineConfig. You could have PDBs that protect it from excessive degradation during a randomized rollout, but if it gets sqeezed into the old-MachineConfig zone A, zones B and C completely update, and then zone A has an outage, you'd be immediately thrust into an outage for that workload.

cgwalters commented 4 years ago

Initial implementation in https://github.com/openshift/machine-config-operator/pull/2162 and a separate PR in https://github.com/openshift/machine-config-operator/pull/2163

zvonkok commented 4 years ago

This could help in preventing nodes from upgrading to a kernel that is not supported by an out-of-tree driver. These special resource nodes can only run the workload. Draining those could kill the workload since the "general" CPU nodes cannot handle them. Customers could then try to build the drivers on the updated nodes without disrupting the special nodes. The question is how many update cycles would this annotated node "survive"?

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/machine-config-operator/issues/2059#issuecomment-830975254): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

cgwalters commented 3 years ago

/lifecycle frozen This keeps coming up

cgwalters commented 3 years ago

Today, any admin that wants to fully control node upgrade ordering can:

First, pause the pool controlling those nodes (e.g. worker). The node controller will then not apply desiredConfig annotations on the targeted notes. However, crucially the render controller will still generate new rendered machineconfig.

So for example, when you upgrade the cluster to pick up a new kernel security update, the worker pool will update to a new spec.configuration. (Note if you've paused the worker but not control plane, then the control plane will have updated)

Now, if you want to manually target a node for update, it should work to oc edit node/<somenode> and update the machineconfiguration.openshift.io/desiredConfig annotation to the current value of spec.configuration in the worker pool.

The machine-config-daemon (daemonset) running on that node will then react to that annotation, and apply it (drain + reboot as needed) in the same way it would as if the node controller had changed it.

openshift / machine-config-operator

Allow administrators to guide upgrade ordering #2059