Scheduled Rolling pool updates (with reports) for XCP-ng

Context

When applying patches, there's often a need to reboot hosts. To avoid any service interruption, you need to migrate VMs around manually. It takes some time, but it's very safe because you know what you do.

However, we could be defensive and create a mechanism to do a rolling pool updates (yum update NOT an upgrade to a newer XCP-ng version).

Prerequisites

Before starting the automated process, we need to be sure that all VMs must be agile (live migrate-able between all hosts of the pool). In other words, use host.assert_can_evacuate on each host to know if we can rolling pool upgrade.

Also we must be sure to be able to detect when a host is back, which can be problematic due how bad Xen Orchestra detects when hosts are down (especially the pool master).

Mechanism

If prereq are OK, then we'll:

Update the pool and wait for the update to finish successfully on all hosts
Save the list of VM on this host. Evacuate the pool master with host.evacuate. XAPI will handle migrations automatically. Then be sure it's a success. Alternative: if we don't care, we can just use evacuate for each host without thinking on saving previous VM position. That should work but when we got the last host, it will be empty and this will unbalance the load.
Reboot the master, wait for it to be back. If it's OK, migrate back VMs evacuated in the previous step
Go for the first slave, doing step 2 and 3.
Repeat for each slave until it's done.
Report success

Due to the nature of the process (long and async), we should probably use an email/whatever report of the rolling update mechanism. This is also something that will be integrated in a XO task mechanism (in XO 6), handling tasks outside just XAPI ones.

Functional usage

In the pool view, when there's update, we should add a button for "Rolling Pool Updates". For XO6, we might think to integrate than in the UI, probably offering the choice if we detect it's possible.

Misc/ideas

Planned rolling pool updates (a la "backup job" with report)

That would be create to plan this rolling update so it will be executed at some time (eg by night during the week-end), so an admin don't have to "witness" what's going on.

Non-agile scenario capabilities

If some VMs can't be evacuated, we might, in the future, give some solutions (like migration with storage, or just shutdown VM and start them back when the host is up to date.

Scheduled RPU

See https://github.com/vatesfr/xen-orchestra/issues/1422

RPU reports

Like a backup report on scheduled RPUs

I'd like to add a scenario that's a mix between agile and non-agile to be considered with this feature:

Scenario On some small pools (2-3 servers) I use a dedicated NFS VM for a portioned section of the Storage (Note to consider: with prevent shutdown flag to prevent human error), shared via the datacenters vlan. Each Shared Storage is then added to the Pool.

This is primarily used for Disaster recovery via Replication, Maintenance tasks, low IO static file serving and small docker containers in VM's which also don't serve much.

Circumstances: I couldn't get a storage Server in the same rack unfortunaly, since it was growing over time. This way migration is still pretty efficient and maintenance is smooth (on this scale) and I consider it a valid use-case to evaluate even on fresh setups, even better with a direct 10GBit connection between them.

Manual Process: Migrate all VM's to another server and their corresponding shared storage Detach to-be-maintained server's shared storage Shutdown inside VM (Signal Shutdown would be sufficient, overriding the prevent shutdown flag) Update. Reboot. Attach Storage after Shared Storage VM autostarted and migrate back to local shared storage.

Potential Issues for the suggested Rolling pool update process If the migration doesn't move the VHD's to another Shared Storage, it will blow. If it doesn't properly detach the Shared Storage, it might blow. If it doesn't shut down the Storage VM with the prevent shutdown flag, it will block the process and whatever will happen. If it tries to migrate the Shared Storage VM, it probably wont have space or it will cause tons of delays If it targets Off-site Shared Storages for Backup, it will cause delays and abysmal IOPS while running.

Suggested Consideration/Solutions Assign a Server to a Shared Storage & a corresponding VM, validate that the VM's that are currently running on the to-be-evacuated server aren't on it's storage. If any are, migrate to another flagged Shared Storage in the pool with the appropriate storage. If there is a running VM that can't be migrated due to the size on a local disk, either have a question dialog and ask for pause or shutdown signal. Consider the behaviour of the prevent shutdown flag again here.

Now Detach the assigned Shared Storage, if successfull, send a shutdown request to the VM that overrides the prevent shutdown flag (or maybe introduce another flag..) Implement a Flag to blacklist Off-site Storage from this

It can now do the rolling pool upgrade, and as suggested already, start the Shared Storage again (I have autostart set as well, so maybe automating setting that is a option), now a re-attach would need to be triggered since the attach on start failed and the VM's can be migrated back to the original Server and their corresponding local Shared storage.

What if we don't want to support this? Implement a fail-safe scenario that would indicate it in the user interface and abort ship.

Misc Recipes could be supplied to also automate such a Setup and provide a alternate budget way that works well with growing systems as well from migrating from non-agile legacy pools, since resizing the storage server VHD's is very trivial. But lets not drift too far into other features...

Honestly I am not sure if a seperate Issue for all the things that need to be considered is appropriate, but this way it can be pretty bullet proofed, even on budget. And there are lots of people out there that have a simliar setup probably for wildly different reasons.

vatesfr / xen-orchestra