Closed alicefr closed 8 months ago
The https://github.com/kubevirt/kubevirt/pull/10536 shows a PoC implementation of the proposal
I updated the design accordingly the latest feedback and renamed the files to storage-migration
.
IMO, the main point to clarify right now is what @awels is suggesting here. It is, of course, possible and it also simplifies the logic, however, it is less flexible.
One of the suggestions I had initially from @fabiand was that the users should simply specify the original and destination storage classes. If they are forced to specify each volume, then we lose this usage.
@mhenriks many thanks for the reviews and valuable comments! We should also consider the offline case and the idea of having a global object for the SM (storage migration) is interesting. However, this considerably complicates the design. I hope you can help me sort things out.
I tried to summarize the problems I see:
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign fabiand for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
hi, I have updated this design proposal. The major modifications are that I removed the part that covers the storage class migration. I think handling it in KubeVirt makes this feature too huge and complex. We should only provide some basic APis in KubeVirt and then another tool can do some smart migration scheduling and managing the PVC life-cycle.
Please, let me know what you think!
how to support StorageMigration without changing vmi nodeName ?
how to support StorageMigration without changing vmi nodeName ?
@wavezhang could you please expand futher your question?
The storage migration triggers a VM migration under the hood. The scheduling of the target virt-launcher pod takes into account also the destination PVC. Does this answer your quetion?
how to support StorageMigration without changing vmi nodeName ?
@wavezhang could you please expand futher your question?
The storage migration triggers a VM migration under the hood. The scheduling of the target virt-launcher pod takes into account also the destination PVC. Does this answer your quetion?
In some cases, users want to live migrate volume without changing node of virt-launcher pod
how to support StorageMigration without changing vmi nodeName ?
@wavezhang could you please expand futher your question? The storage migration triggers a VM migration under the hood. The scheduling of the target virt-launcher pod takes into account also the destination PVC. Does this answer your quetion?
In some cases, users want to live migrate volume without changing node of virt-launcher pod
I think this should be the preferable option in general. Right now, we are putting anti-affinity between the source and target virt-launcher pods to get the second pod scheduled on another node. However, I plan to extend a put an affinity between the 2 pods when we trigger the volume migration. In this way. the copy should be faster. I haven't implemented yet but it is on my todo list.
But I would rather avoid to have an explicit node selector. Does this approach sound reasonable to you?
I think this should be the preferable option in general. Right now, we are putting anti-affinity between the source and target virt-launcher pods to get the second pod scheduled on another node. However, I plan to extend a put an affinity between the 2 pods when we trigger the volume migration. In this way. the copy should be faster. I haven't implemented yet but it is on my todo list.
That's nice.
In the current virt-handler design, the assumption is that a VMI corresponds to only one virt-launcher pod on a node, is there any details on how to support multiple/two virt-launcher pod for one vmi on same node ?
I think this should be the preferable option in general. Right now, we are putting anti-affinity between the source and target virt-launcher pods to get the second pod scheduled on another node. However, I plan to extend a put an affinity between the 2 pods when we trigger the volume migration. In this way. the copy should be faster. I haven't implemented yet but it is on my todo list.
That's nice.
In the current virt-handler design, the assumption is that a VMI corresponds to only one virt-launcher pod on a node, is there any details on how to support multiple/two virt-launcher pod for one vmi on same node ?
I plan to add this as a follow-up and it isn't covered by this design proposal
@fabiand @mhenriks @awels @vasiliy-ul would you mind taking another look to this proposal? I had several offline discussions and I'd like to have your opinion on the current state. The proposal has been simplified and it only takes care of the migration of explicit list of volumes.
@awels @vasiliy-ul I tried to write a complex example that shows fully the volume migration status. I hope this helps with the review
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vm1
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: datavolumedisk1
- disk:
bus: virtio
name: datavolumedisk2
volumes:
- dataVolume:
name: src-pvc1
name: datavolumedisk1
- dataVolume:
name: src-pvc2
name: datavolumedisk2
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vm2
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: datavolumedisk3
volumes:
- dataVolume:
name: src-pvc3
name: datavolumedisk3
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vm3
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
shareable: true
name: datavolumedisk4
- disk:
bus: virtio
name: datavolumedisk5
volumes:
- dataVolume:
name: src-pvc4
name: datavolumedisk4
- dataVolume:
name: src-pvc5
name: datavolumedisk5
---
apiVersion: storage.kubevirt.io/v1alpha1
kind: VolumeMigration
metadata:
name: vol-mig
spec:
migratedVolume:
- sourcePvc: src-pvc1
destinationPvc: dest-pvc1
reclaimPolicySourcePvc: Delete
- sourcePvc: scr-pvc2
destinationPvc: dest-pvc2
reclaimPolicySourcePvc: Delete
- sourcePvc: scr-pvc3
destinationPvc: dest-pvc3
reclaimPolicySourcePvc: Delete
- sourcePvc: scr-pvc4
destinationPvc: dest-pvc4
reclaimPolicySourcePvc: Delete
- sourcePvc: scr-pvc5
destinationPvc: dest-pvc5
reclaimPolicySourcePvc: Delete
- sourcePvc: scr-pvc6
destinationPvc: dest-pvc6
reclaimPolicySourcePvc: Delete
status:
volumeMigrationStates:
- migratedVolume:
- sourcePvc: src-pvc1
destinationPvc: dest-pvc1
reclaimPolicySourcePvc: Delete
- sourcePvc: scr-pvc2
destinationPvc: dest-pvc2
reclaimPolicySourcePvc: Delete
virtualMachineMigrationName: vol-mig-vm1
virtualMachineInstanceName: vm1
startTimestamp: "2024-01-12T08:54:11Z"
- migratedVolume:
- sourcePvc: src-pvc3
destinationPvc: dest-pvc3
reclaimPolicySourcePvc: Delete
virtualMachineMigrationName: vol-mig-vm2
virtualMachineInstanceName: vm2
startTimestamp: "2024-01-12T08:54:11Z"
endTimestamp: "2024-01-12T09:54:11Z"
succeeded: true
rejectedVolumeMigrations:
- rejectedVolumes:
- sourcePv: src-pvc-4
reason: "Shareable disks aren't supported to be migrated"
virtualMachineInstanceName: vm3
associatedVolumes:
- src-pvc-5
pendingVolumes:
- sourcePvc: scr-pvc6
destinationPvc: dest-pvc6
reclaimPolicySourcePvc: Delete
total: 4
succeeded: 1
running: 1
pending: 1
rejected: 1
In this case, we have 3 VMs. One migration is running, one has finished and one has been rejected. The migrated volume src-pvc-6
isn't assigned to any VMs and they are marked as pending
@alicefr, thanks for providing the example :+1: One thing that quickly comes to my mind here. Some of the PVCs are rejected. Why do we even have other migrations started?
- migratedVolume:
- sourcePvc: src-pvc1
destinationPvc: dest-pvc1
reclaimPolicySourcePvc: Delete
- sourcePvc: scr-pvc2
destinationPvc: dest-pvc2
reclaimPolicySourcePvc: Delete
virtualMachineMigrationName: vol-mig-vm1
virtualMachineInstanceName: vm1
startTimestamp: "2024-01-12T08:54:11Z"
Its about the conversation https://github.com/kubevirt/community/pull/245#discussion_r1447355694. I thought if we rejected some of the PVCs then we should leave everything untouched. Or did I misunderstand smth?
@alicefr, thanks for providing the example 👍 One thing that quickly comes to my mind here. Some of the PVCs are rejected. Why do we even have other migrations started?
- migratedVolume: - sourcePvc: src-pvc1 destinationPvc: dest-pvc1 reclaimPolicySourcePvc: Delete - sourcePvc: scr-pvc2 destinationPvc: dest-pvc2 reclaimPolicySourcePvc: Delete virtualMachineMigrationName: vol-mig-vm1 virtualMachineInstanceName: vm1 startTimestamp: "2024-01-12T08:54:11Z"
Its about the conversation #245 (comment). I thought if we rejected some of the PVCs then we should leave everything untouched. Or did I misunderstand smth?
I leave untouched the pvcs that belongs to the same VMI that has a rejected one but I migrate the valid ones that belong to another VMI.
In the example src-pvc-1/2/3
belongs to 2 VMIs (vm1
and vm2
) that have valid migrated volumes, while vm3
has a rejected volume, therefore src-pvc-4
and src-pvc-5
are left untouched.
The status also gives the users a hint why the volume migration was rejected for vm3
. They need to unset shareable for src-pvc-4
, but src-pvc-5
is valid and that's why is listed under associateVolumes
.
Okay, then we probably misunderstood each other regarding 'leaving everything untouched' :)
If we can detect in advance that some part of the migration request cannot be fulfilled, do we really need to proceed with the rest? I mean, personally, I as a user would expect that a migration request should (ideally) complete successfully as a whole or fail as a whole. Well, realistically, we cannot guarantee it since e.g. a migration may fail for only one VM and succeed for the others. So it is a best-effort. But when we have rejected volumes, in my opinion, this is sort of a precondition for starting the whole thing. WDYT?
Maybe we could also consider adding a field in VolumeMigration
CR like ignoreRejectedPVCs
to allow users to state the intention explicitly?
Okay, then we probably misunderstood each other regarding 'leaving everything untouched' :)
If we can detect in advance that some part of the migration request cannot be fulfilled, do we really need to proceed with the rest? I mean, personally, I as a user would expect that a migration request should (ideally) complete successfully as a whole or fail as a whole. Well, realistically, we cannot guarantee it since e.g. a migration may fail for only one VM and succeed for the others. So it is a best-effort. But when we have rejected volumes, in my opinion, this is sort of a precondition for starting the whole thing. WDYT?
Maybe we could also consider adding a field in
VolumeMigration
CR likeignoreRejectedPVCs
to allow users to state the intention explicitly?
It is a possibility. I don't have a strong opinion about it. @awels what do you think?
So I think that a higher level component will probably never generate a VolumeMigration
that would include multiple VMs. It would generate one VolumeMigration
per VM, that includes all the PVCs that are part of that VM. It will have the knowledge on how to do that. That way the result will be what both @alicefr and @vasiliy-ul expect (only entire VM migration is rejected and the VolumeMigration is rejected, if one of the PVCs is rejected).
I think it might even make sense to simply not start a migration if you detect multiple VMs. Some weird edge case will be a shared disk though. I don't know if dealing with just one VM per VolumeMigration
makes it easier for you or not.
I think it might even make sense to simply not start a migration if you detect multiple VMs.
I tend to agree here. By supporting migration of multiple VMs, VolumeMigration takes a lot of 'responsibility' and makes the usage of the API not intuitive. I would also suggest making it simpler and refuse migration of multiple VMs. At least at the first iteration. Potentially this can be added later if there is a demand from the users of the feature.
So I think that a higher level component will probably never generate a
VolumeMigration
that would include multiple VMs. It would generate oneVolumeMigration
per VM, that includes all the PVCs that are part of that VM. It will have the knowledge on how to do that. That way the result will be what both @alicefr and @vasiliy-ul expect (only entire VM migration is rejected and the VolumeMigration is rejected, if one of the PVCs is rejected).I think it might even make sense to simply not start a migration if you detect multiple VMs. Some weird edge case will be a shared disk though. I don't know if dealing with just one VM per
VolumeMigration
makes it easier for you or not.
Not quite sure if from the user perspective this is a nice option. They would need to group all the volumes per VMI. I know that we plan to add some additional logic in an overlaying tool to do this automatically, but I cannot assume it in this design. On the other hand, this will simplify the status report.
If we decided to deal with multiple VMs, however, we need to take into consideration, that there are possibly some pending volumes that could be triggered at a later time by launching a VMI. They could also be rejected. Another case, it is when one of the VM migration fails. The status of the VolumeMigration with mutliple VMs will never have a single state.
I think it might even make sense to simply not start a migration if you detect multiple VMs.
I tend to agree here. By supporting migration of multiple VMs, VolumeMigration takes a lot of 'responsibility' and makes the usage of the API not intuitive. I would also suggest making it simpler and refuse migration of multiple VMs. At least at the first iteration. Potentially this can be added later if there is a demand from the users of the feature.
Ok, this simplifies a bit my life :). I'll update the design and the POC PR next week
I think it might even make sense to simply not start a migration if you detect multiple VMs.
I tend to agree here. By supporting migration of multiple VMs, VolumeMigration takes a lot of 'responsibility' and makes the usage of the API not intuitive. I would also suggest making it simpler and refuse migration of multiple VMs. At least at the first iteration. Potentially this can be added later if there is a demand from the users of the feature.
Right this is my thinking as well. I am working on a tool that takes that responsibility from this API. For the first iteration, simpler is better IMO.
I updated the proposal and the API with the focus on supporting a single VM migration per volume migration object.
I have a question about the new controller. This proposal introduces a new controller for the new CRD. However, this controller is also responsible to update the VMI and VM spec. Would be this design reasonable or does anyone see any issue with this? AFAIU, today, every objet has its own controller that updates the fields. Triggering the VMI and VM controller seems more complex to me but if you see any issues I can also explore this option.
I have a question about the new controller. This proposal introduces a new controller for the new CRD. However, this controller is also responsible to update the VMI and VM spec. Would be this design reasonable or does anyone see any issue with this? AFAIU, today, every objet has its own controller that updates the fields. Triggering the VMI and VM controller seems more complex to me but if you see any issues I can also explore this option.
I do not know the historical reasons behind the existing design, but keep in mind that if you modify VMI and VM objects from several controllers, you will need to ensure some sort of synchronization. Otherwise, you may introduce hard to debug race conditions.
So this seems reasonable to me:
today, every objet has its own controller that updates the fields.
I have a question about the new controller. This proposal introduces a new controller for the new CRD. However, this controller is also responsible to update the VMI and VM spec. Would be this design reasonable or does anyone see any issue with this? AFAIU, today, every objet has its own controller that updates the fields. Triggering the VMI and VM controller seems more complex to me but if you see any issues I can also explore this option.
I do not know the historical reasons behind the existing design, but keep in mind that if you modify VMI and VM objects from several controllers, you will need to ensure some sort of synchronization. Otherwise, you may introduce hard to debug race conditions.
So this seems reasonable to me:
today, every objet has its own controller that updates the fields.
After thinking a bit more about it, I think it is possible.
Currently, I have all the logic in the controller for volume migration but I can move it to the VMI/VM controllers.
For the VMI, once the MigrateState is successful and there are migrated volumes in the status, then the VMI controller can update the VMI spec. For the VM, the VM controllers can react if the VMI has update its volumes in the spec. Then the VM controller can reflect the change in the VM too.
Updated the design proposal with the last feedback on the API, corrected the typos and added examples for the pending and rejected volumes cases.
@awels many thanks for the review! I have update the proposal accordingly,
@vasiliy-ul when you have some time, do you mind to have another look to the proposal? I'd like to consolidate the API and update the code accordingly.
I update the proposal with the last feedback, set Delete
as the default retain policy and updated the examples consequently
@awels @vasiliy-ul @mhenriks I have updated the design proposal with the last feedback and discussion. Would you mind taking another look when you have time?
In particularly, I added more details about the DV handling in the design section and the limitations regarding gitops or similar declarative mechanism
@fabiand @xpivarc @acardace @vladikr I think this design is reaching a quite stable phase. Could you please also review it and share feedback?
/lgtm
New changes are detected. LGTM label has been removed.
Thanks @alicefr this looks pretty neat!
I wonder if we really need a new CRD to manage this, what do you think about doing this transparently when you edit the VM spec now that we have the rollout strategy mechanism merged? If you were to edit the spec with an incompatible volume you'd just set the
RestartRequired
condition and wait for a restart, otherwise you'd migrate the VM swapping the old volume with the new one.What do you think?
Many thanks for the review.
I don't know if what you are suggesting is possible.
First, it is only applicable to VMs (this might be still acceptable).
Second, what if the volume migration fails? You have modified the spec VM but you should roll back to the previous values. Additionally, the migration can take very long. What should we do in this case, apply the update of the spec only if the migration is successful after a long period, like 2 hours?
Third, the information in the status of the new CRD should be added to the VM status. This might be quite a lot of information to add there.
The last issue I can currently see with this approach is that you cannot specify additional fields like the sourceReclaimPolicy. Right now we don't have many , so we might decide a single behavior but I think we could need additional options in the future. Do you see any workarounds to these points?
UPDATE: On the other side, the big advantage of this approach is that it would be compatible with gitops. @awels @mhenriks what do you think?
One additional thing I need to change is to replace the Phase
with Conditions
. Apparently, this is deprecated in kubernetes (see the documentation about the api conventions)
@acardace What if we keep the new CRD but we add in a second step the option that an update of the VM spec can generate a volume migration object? The storage migration is complex enough to deserve its own API and CRD. WDYT?
Thanks @alicefr this looks pretty neat! I wonder if we really need a new CRD to manage this, what do you think about doing this transparently when you edit the VM spec now that we have the rollout strategy mechanism merged? If you were to edit the spec with an incompatible volume you'd just set the
RestartRequired
condition and wait for a restart, otherwise you'd migrate the VM swapping the old volume with the new one. What do you think?Many thanks for the review.
I don't know if what you are suggesting is possible.
First, it is only applicable to VMs (this might be still acceptable).
That's how rollout strategy works, so I don't see an issue there.
Second, what if the volume migration fails? You have modified the spec VM but you should roll back to the previous values. Additionally, the migration can take very long. What should we do in this case, apply the update of the spec only if the migration is successful after a long period, like 2 hours?
We just keep retrying to migrate, we still aim for eventual consistency, and we should not roll anything back.
The change would be triggered by a patch in the VM spec so this is a good question! For CPU/Memory hotplug we publish a condition that the hotplug is in progress to later remove it when the migration has been completed.
Third, the information in the status of the new CRD should be added to the VM status. This might be quite a lot of information to add there.
I'd not consider this a blocker, CPU/Memory hotplug both add info they need to the status.
The last issue I can currently see with this approach is that you cannot specify additional fields like the sourceReclaimPolicy. Right now we don't have many , so we might decide a single behavior but I think we could need additional options in the future. Do you see any workarounds to these points?
You could decide to have this setting only at a cluster level, otherwise, you'd have to add this setting in the current Volume
spec to have it per-VM, but the direction lately is to keep things simple with sane defaults.
UPDATE: On the other side, the big advantage of this approach is that it would be compatible with gitops. @awels @mhenriks what do you think?
yes, indeed.
@alicefr @acardace A couple notes on controlling migration via VM.
First, a cluster admin looking to migrate from storage class A to storage class B would likely prefer an api that is VM agnostic since they really only know/care about PVCs. Mapping PVCs to VMs is an extra step. But I don't think it is the intention that most users will invoke this API directly. Rather there will be come other operator/controller/CRD managing a bulk migration. So I think a VM based migration API is fine.
Now the question is how can we have a storage migration API that is gitops friendly? I don't think this is a straightforward as with mem/cpu hotplug. Migration is a two step process: 1) migrate to new volume 2) remove old volume and in order for migration to be compatible with gitops, both of those steps have to be driven by the user. I think we'd have to do something like this:
VirtualMachine at t0
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootDisk
volumes:
- name: rootDisk
persistentVolumeClaim:
claimName: vm-root
[...}
t1 - Specify migration volume
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootDisk
- disk:
bus: virtio
name: migratedDisk
volumes:
- name: rootDisk
persistentVolumeClaim:
claimName: vm-root
- name: migrationDisk
volumeMigration:
sourceClaim: vm-root
targetClaim: vm-root-mig
[...}
t3 - migration completes
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
status:
volumeMigrationStatuses:
- name: migrationDisk
completed: true
# other stuff
t4 - remove old volume
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootDisk
volumes:
- name: rootDisk
persistentVolumeClaim:
claimName: vm-root-mig
[...}
Now the question is how can we have a storage migration API that is gitops friendly? I don't think this is a straightforward as with mem/cpu hotplug. Migration is a two step process: 1) migrate to new volume 2) remove old volume and in order for migration to be compatible with gitops, both of those steps have to be driven by the user. I think we'd have to do something like this:
What @acardace is suggesting I do think it is gitops friendly. For example the original VM specification is:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootDisk
volumes:
- name: rootDisk
persistentVolumeClaim:
claimName: src-pvc
[...}
and then the user changes to:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootDisk
volumes:
- name: rootDisk
persistentVolumeClaim:
claimName: dst-pvc
[...}
This update should trigger the volume migration under the hood.
In the case of datavolumes, the user should replace them in the volumes section with PVCs AND remove the data volume template.
Now, the missing piece here is handling the source volumes after the migration. In the current proposal, this is defined by the SourceReclaimPolicy
and it would missing from the VM update. An option would be to futher simplify this and to delegate the source volume removal to the overlaying tool. @awels how does it sound this to you?
I personally prefer to avoid to put any knowledge of volume migration in the VM specification.
The other question I have would if we should preserve the Volume Migration CRD or completely remove it. This would lead to 2 paths how we can migrate the storage. Initially, we could avoid the introduction of the CRD and eventually add it in a second round?
How this sound to you?
Now, the missing piece here is handling the source volumes after the migration. In the current proposal, this defined by the
SourceReclaimPolicy
and it would missing from the VM update. An option would be to futher simplify this and to delegate the source volume removal to the overlaying tool. @awels how does it sound this to you?
My tool already has to know this information because it would have to set the field in the CRD. So it is not that much more work to simply do the work instead of delegating it.
I personally prefer to avoid to put any knowledge of volume migration in the VM specification.
Agreed
This update should trigger the volume migration under the hood.
@alicefr I don't see how that scheme would work if the VM has multiple disks. Also what if someone wants to replace (not migrate) a disk with a different already existing/populated disk?
This update should trigger the volume migration under the hood.
@alicefr I don't see how that scheme would work if the VM has multiple disks. Also what if someone wants to replace (not migrate) a disk with a different already existing/populated disk?
For multiple disks, I think, the user would need to patch the spec with multiple volumes at the same time.
As for replacing, I think we need to define what "replace" means. First, the volume migration would be triggered only if the VM is running. In order to replace an existing disk on a running VM, this would imply to unplug the old disk and to hotplug the new one. I agree this will be an hard and subtle difference to define. I think for replacing the same disk this should be 2 different updates and for volume migration should in the same request.
/cc @acardace @xpivarc this is what we have discussed this morning about hotplug
This update should trigger the volume migration under the hood.
@alicefr I don't see how that scheme would work if the VM has multiple disks. Also what if someone wants to replace (not migrate) a disk with a different already existing/populated disk?
For multiple disks, I think, the user would need to patch the spec with multiple volumes at the same time.
I guess can just assume the same order as before but I typically prefer explicit to implicit
As for replacing, I think we need to define what "replace" means. First, the volume migration would be triggered only if the VM is running. In order to replace an existing disk on a running VM, this would imply to unplug the old disk and to hotplug the new one. I agree this will be an hard and subtle difference to define. I think for replacing the same disk this should be 2 different updates and for volume migration should in the same request.
My point is mostly about the VM stopped case. I don't think a config change should do two different things bases on whether the VM is running or not. Someone could have their own scheme for backing up/restoring VM disks and just want to switch to an older version of the disk from backup
/cc @acardace @xpivarc this is what we have discussed this morning about hotplug
hi,
There have been multiple discussions in parallel, and the current situation is pretty chaotic, so I'd like to provide an update on the current challenges and options.
The primary concern with the actual architecture is the order in which the VM specifications are updated with the destination volumes. In the current design, this occurs as the final step when the migration is successful. This appears to be an anti-pattern because it is not gitops friendly, but it also automatically modifies the VM specification without the user's explicit action.
An alternative to this behavior was suggested: triggers the volume migration when the user explicitly updates the volumes according to the VM specification. Unfortunately, this method may be confusing because it isn't easy to tell whether the user wished to replace an old disk with a new one, such as by unplugging and then hot-plugging or restarting the VM, or to make a complete duplicate of the disk with volume migration.
Ideally, the desired output of this discussion should be a clean approach to trigger volume migration when the volumes in the VM specification are updated.. Like changing the claimName
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootDisk
volumes:
- name: rootDisk
persistentVolumeClaim:
claimName: src-pvc
[...}
to:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
template:
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootDisk
volumes:
- name: rootDisk
persistentVolumeClaim:
claimName: dst-pvc
[...}
I don't see a simple solution to this problem, however several possibilities were suggested:
Both approaches are not optimal, and I see the following disadvantages:
Between the 2 current solutions, I would prefer the second option. Does anyone have better ideas? /cc @vladikr @acardace @mhenriks @awels @xpivarc
Yeah thinking about it a bit more option 1 is too racy and one would have to define a specific order, which is probably not what we want. Not sure I like option 2 either. What happens if we create our CR, the VMI migrates and the storage there is changed, but then the user never updates the VM. We now have an inconsistent state. If the user shuts down the VM and starts it again, it would be back to the original state. We could at least track in the CR if the VM has been updated by the user or not. Give some indication there is something left to do?
Yeah thinking about it a bit more option 1 is too racy and one would have to define a specific order, which is probably not what we want. Not sure I like option 2 either. What happens if we create our CR, the VMI migrates and the storage there is changed, but then the user never updates the VM. We now have an inconsistent state. If the user shuts down the VM and starts it again, it would be back to the original state. We could at least track in the CR if the VM has been updated by the user or not. Give some indication there is something left to do?
Yeah with option 2, the VM is in a very precarious state for the time between the migration completes and spec is updated. To address this, the VM controller would have to be made aware of the volume migration CR and then we are back to option 1
@awels @mhenriks, what I'm hearing here is that neither option 1 nor 2 are viable, am I correct?
@awels @mhenriks, what I'm hearing here is that neither option 1 nor 2 are viable, am I correct?
Not sure I'm ready to give up on option 2 yet. Maybe we can leverage VM status to make it viable? Perhaps by adding a storageMigrationInProgress
field in VM status that refers to the name of migration CR. When the migration starts, the value is set and remains set until the user updates the vm spec to match the migration. So the storage migration controller is watching the VM for spec changes. And there is a finalizer on the migration until the VM spec is updated
There is already a similar field called restoreInProgress
fro VM restores.
From my point of view modifying the VMI spec is an ant pattern as well. I would prefer to avoid it.
I'm still puzzled about what is too racy with the first option. If I understood correctly, disk replacements are only done via restarts
. Hence, if the VM is running and the VM spec is updated we can assume that this is a disk replacement and do nothing until the VM "restarts". During that time a Volume migration object may appear then we act. If the object is already there at the time of the VM change, we act immediately.
@vladikr I think the problem arises, for example, when the VM is running and the volume migration object creation is slower then the update. I tried to sketch the 2 possible cases.
In case 1, when the server process the VM update can see the volume migration object and can interpret that it is a storage migration.
However, the case that concerns me it is the second case, if at the moment of the VM spec update the volume migration object isn't visible by the controller. In this case, the update is interpreted as a unplug of the old volume and an hotplug of the new one. In this case, the second pvc is empty as we should have copied the data.
I think there are other cases like this one.
This proposal is a follow-up from the https://github.com/kubevirt/community/pull/240 and only targets moving the storage to another PVC. The VM live migration with local storage is out-of-scope.
In comparison to the previous proposal, this introduces a new CRD for representing the Volume Migration.