migtools / mig-controller

OpenShift Migration Controller
Apache License 2.0
22 stars 42 forks source link

Handle VM state changes during storage live migration #1394

Closed awels closed 4 days ago

awels commented 1 week ago

It is possible that during the storage live migration of a VM the state of that VM changes. This can happen when the VM is started or stopped. When a running VM is stopped, the live migration is cancelled and an offline migration is started instead.

The reverse is also true, if a VM is started during and offline migration, the offline migration is cancelled and a live migration is started instead.

Since offline migrations use a single rsync server with potentially multiple clients. If we stop a VM but another offline migration is running, that one is allowed to complete before starting a new offline migration. The reverse is also true, if a VM is started while it is part of an offline migration, the offline migration is allowed to complete before starting a live migration.

The following combinations of VM state and changes should behave as described in this table

VM state when cutting over VM state change Expected behavior
One VM off VM is started after both rsync server and rsync client are created (pending or running) rsync server and client are stopped, and live migration completes
One VM on VM is stopped after cutover is started and second virt-launcher is running Both virt-launcher pods are stopped, and only after they are gone will the rsync server and client get created
One VM on VM is stopped after cutover is started and second virt-launcher is not running virt-launcher pods are stopped, and only after they are gone will the rsync server and client get created
Two VMs, both off One VM is started after rsync server and both clients are created (pending or running) One client is stopped immediately (the one associated with the VM that was started). The other client is allowed to complete, only then will the live migration of the VM start. This is to prevent anything in the running rsync server from messing with the live migration
Two VMs, both off Both VMs are started after rsync server and both clients are created (pending or running) Both clients are stopped as well as the rsync server, once all the rsync pods have stopped, the live migration will start
Two VMs, both running One VM is stopped after live migrations have started The running live migration completes, and an rsync server and client are created for the VM that is stopped, and runs to completion
Two VMs, both running Both VMs are stopped after live migrations have started All virt-launcher pods are stopped, and an rsync server and two clients are created and the offline migration runs to completion (it is possible for this scenario to turn create an rsync server and client, and then after completion does it again for the second VM disk, this happens if the rsync server starts before the second VM is stopped)
Two VMs, one running, one stopped The stopped VM is started after both live migration and offline migration have started The rsync server and client are stopped, and a new live migration is created and both live migrations run to completion
Two VMs, one running, one stopped The running VM is stopped after both live migration and offline migration have started The virt-launchers pods are stopped, and the rsync server and client run to completion, after they complete a new rsync server and client are created for the newly stopped VM, this runs to completion