kubevirt / community

Community content
https://kubevirt.io
50 stars 105 forks source link

design-proposal: volume migration #245

Closed alicefr closed 8 months ago

alicefr commented 1 year ago

This proposal is a follow-up from the https://github.com/kubevirt/community/pull/240 and only targets moving the storage to another PVC. The VM live migration with local storage is out-of-scope.

In comparison to the previous proposal, this introduces a new CRD for representing the Volume Migration.

alicefr commented 1 year ago

The https://github.com/kubevirt/kubevirt/pull/10536 shows a PoC implementation of the proposal

alicefr commented 1 year ago

I updated the design accordingly the latest feedback and renamed the files to storage-migration. IMO, the main point to clarify right now is what @awels is suggesting here. It is, of course, possible and it also simplifies the logic, however, it is less flexible. One of the suggestions I had initially from @fabiand was that the users should simply specify the original and destination storage classes. If they are forced to specify each volume, then we lose this usage.

alicefr commented 1 year ago

@mhenriks many thanks for the reviews and valuable comments! We should also consider the offline case and the idea of having a global object for the SM (storage migration) is interesting. However, this considerably complicates the design. I hope you can help me sort things out.

I tried to summarize the problems I see:

  1. This feature will be part of KubeVirt core, we shouldn't assume that CDI is installed
  2. The Storage Migration CRD should be the same for online and offline Storage Migration, however, they translate to 2 completely different paths. IMO, we should introduce them incrementally. Not sure what should be the first.
  3. Offline Storage Migration:
    1. What do we do if there is a VM turning on? We could delay the booting until the SM is completed or abort the SM.
    2. Do we rely on CSI clone? What if the SC doesn’t support it?
    3. What if the PVCs of a storage class that are WFFC, then the clone is done only when there is a pod using it. This complicates the reclaim process for the source PVC. Do we want to force the binding or wait until there is a first pod using it?
  4. Online storage migration:
    1. Complexity introduced by the management of multiple VMs
      1. How do we migrate the VMs, in parallel or in sequence? The SM is very resource-consuming, hence we probably don’t want to trigger all the SM migration simoultaneousy. Do we need to introduce a StorageMigrationSequenceMode with Serial and Parallel
      2. What do we do if a SM fails and the others are successful? I don’t think we can rollback to the previous state. This will consume more resources and might lead to an even messier state.
      3. Similarly to the previous point, what if the SM object is deleted during the migrations? What do we do?
    2. What do we do for online SM if the VM is turn-off or the virt-launcher pod deleted? Do we abort the SM migration or do we want to force it to complete? For executing the SM, we need the virt-launcher pod up and running
kubevirt-bot commented 11 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign fabiand for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubevirt/community/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
alicefr commented 11 months ago

hi, I have updated this design proposal. The major modifications are that I removed the part that covers the storage class migration. I think handling it in KubeVirt makes this feature too huge and complex. We should only provide some basic APis in KubeVirt and then another tool can do some smart migration scheduling and managing the PVC life-cycle.

Please, let me know what you think!

wavezhang commented 10 months ago

how to support StorageMigration without changing vmi nodeName ?

alicefr commented 10 months ago

how to support StorageMigration without changing vmi nodeName ?

@wavezhang could you please expand futher your question?

The storage migration triggers a VM migration under the hood. The scheduling of the target virt-launcher pod takes into account also the destination PVC. Does this answer your quetion?

wavezhang commented 10 months ago

how to support StorageMigration without changing vmi nodeName ?

@wavezhang could you please expand futher your question?

The storage migration triggers a VM migration under the hood. The scheduling of the target virt-launcher pod takes into account also the destination PVC. Does this answer your quetion?

In some cases, users want to live migrate volume without changing node of virt-launcher pod

alicefr commented 10 months ago

how to support StorageMigration without changing vmi nodeName ?

@wavezhang could you please expand futher your question? The storage migration triggers a VM migration under the hood. The scheduling of the target virt-launcher pod takes into account also the destination PVC. Does this answer your quetion?

In some cases, users want to live migrate volume without changing node of virt-launcher pod

I think this should be the preferable option in general. Right now, we are putting anti-affinity between the source and target virt-launcher pods to get the second pod scheduled on another node. However, I plan to extend a put an affinity between the 2 pods when we trigger the volume migration. In this way. the copy should be faster. I haven't implemented yet but it is on my todo list.

But I would rather avoid to have an explicit node selector. Does this approach sound reasonable to you?

wavezhang commented 10 months ago

I think this should be the preferable option in general. Right now, we are putting anti-affinity between the source and target virt-launcher pods to get the second pod scheduled on another node. However, I plan to extend a put an affinity between the 2 pods when we trigger the volume migration. In this way. the copy should be faster. I haven't implemented yet but it is on my todo list.

That's nice.

In the current virt-handler design, the assumption is that a VMI corresponds to only one virt-launcher pod on a node, is there any details on how to support multiple/two virt-launcher pod for one vmi on same node ?

alicefr commented 10 months ago

I think this should be the preferable option in general. Right now, we are putting anti-affinity between the source and target virt-launcher pods to get the second pod scheduled on another node. However, I plan to extend a put an affinity between the 2 pods when we trigger the volume migration. In this way. the copy should be faster. I haven't implemented yet but it is on my todo list.

That's nice.

In the current virt-handler design, the assumption is that a VMI corresponds to only one virt-launcher pod on a node, is there any details on how to support multiple/two virt-launcher pod for one vmi on same node ?

I plan to add this as a follow-up and it isn't covered by this design proposal

alicefr commented 10 months ago

@fabiand @mhenriks @awels @vasiliy-ul would you mind taking another look to this proposal? I had several offline discussions and I'd like to have your opinion on the current state. The proposal has been simplified and it only takes care of the migration of explicit list of volumes.

alicefr commented 10 months ago

@awels @vasiliy-ul I tried to write a complex example that shows fully the volume migration status. I hope this helps with the review

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm1
spec:
  template:
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: datavolumedisk1
          - disk:
              bus: virtio
            name: datavolumedisk2
      volumes:
      - dataVolume:
          name: src-pvc1
        name: datavolumedisk1
      - dataVolume:
          name: src-pvc2
        name: datavolumedisk2
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm2
spec:
  template:
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: datavolumedisk3
      volumes:
      - dataVolume:
          name: src-pvc3
        name: datavolumedisk3
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm3
spec:
  template:
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            shareable: true
            name: datavolumedisk4
          - disk:
              bus: virtio
            name: datavolumedisk5
      volumes:
      - dataVolume:
          name: src-pvc4
        name: datavolumedisk4
      - dataVolume:
          name: src-pvc5
        name: datavolumedisk5
---
apiVersion: storage.kubevirt.io/v1alpha1
kind: VolumeMigration
metadata:
  name: vol-mig
spec:
  migratedVolume:
  - sourcePvc: src-pvc1
    destinationPvc: dest-pvc1
    reclaimPolicySourcePvc: Delete
  - sourcePvc: scr-pvc2
    destinationPvc: dest-pvc2
    reclaimPolicySourcePvc: Delete
  - sourcePvc: scr-pvc3
    destinationPvc: dest-pvc3
    reclaimPolicySourcePvc: Delete
  - sourcePvc: scr-pvc4
    destinationPvc: dest-pvc4
    reclaimPolicySourcePvc: Delete
  - sourcePvc: scr-pvc5
    destinationPvc: dest-pvc5
    reclaimPolicySourcePvc: Delete
  - sourcePvc: scr-pvc6
    destinationPvc: dest-pvc6
    reclaimPolicySourcePvc: Delete
status:
  volumeMigrationStates:
  - migratedVolume:
    - sourcePvc: src-pvc1
      destinationPvc: dest-pvc1
      reclaimPolicySourcePvc: Delete
    - sourcePvc: scr-pvc2
      destinationPvc: dest-pvc2
      reclaimPolicySourcePvc: Delete
    virtualMachineMigrationName: vol-mig-vm1
    virtualMachineInstanceName: vm1
    startTimestamp: "2024-01-12T08:54:11Z"
  - migratedVolume:
    - sourcePvc: src-pvc3
      destinationPvc: dest-pvc3
      reclaimPolicySourcePvc: Delete
    virtualMachineMigrationName: vol-mig-vm2
    virtualMachineInstanceName: vm2
    startTimestamp: "2024-01-12T08:54:11Z"
    endTimestamp: "2024-01-12T09:54:11Z"
    succeeded: true
  rejectedVolumeMigrations:
  - rejectedVolumes:
    - sourcePv: src-pvc-4
      reason: "Shareable disks aren't supported to be migrated"
    virtualMachineInstanceName: vm3
    associatedVolumes:
    - src-pvc-5
  pendingVolumes:
  - sourcePvc: scr-pvc6
    destinationPvc: dest-pvc6
    reclaimPolicySourcePvc: Delete
  total: 4
  succeeded: 1
  running: 1
  pending: 1
  rejected: 1

In this case, we have 3 VMs. One migration is running, one has finished and one has been rejected. The migrated volume src-pvc-6 isn't assigned to any VMs and they are marked as pending

vasiliy-ul commented 10 months ago

@alicefr, thanks for providing the example :+1: One thing that quickly comes to my mind here. Some of the PVCs are rejected. Why do we even have other migrations started?

  - migratedVolume:
    - sourcePvc: src-pvc1
      destinationPvc: dest-pvc1
      reclaimPolicySourcePvc: Delete
    - sourcePvc: scr-pvc2
      destinationPvc: dest-pvc2
      reclaimPolicySourcePvc: Delete
    virtualMachineMigrationName: vol-mig-vm1
    virtualMachineInstanceName: vm1
    startTimestamp: "2024-01-12T08:54:11Z"

Its about the conversation https://github.com/kubevirt/community/pull/245#discussion_r1447355694. I thought if we rejected some of the PVCs then we should leave everything untouched. Or did I misunderstand smth?

alicefr commented 10 months ago

@alicefr, thanks for providing the example 👍 One thing that quickly comes to my mind here. Some of the PVCs are rejected. Why do we even have other migrations started?

  - migratedVolume:
    - sourcePvc: src-pvc1
      destinationPvc: dest-pvc1
      reclaimPolicySourcePvc: Delete
    - sourcePvc: scr-pvc2
      destinationPvc: dest-pvc2
      reclaimPolicySourcePvc: Delete
    virtualMachineMigrationName: vol-mig-vm1
    virtualMachineInstanceName: vm1
    startTimestamp: "2024-01-12T08:54:11Z"

Its about the conversation #245 (comment). I thought if we rejected some of the PVCs then we should leave everything untouched. Or did I misunderstand smth?

I leave untouched the pvcs that belongs to the same VMI that has a rejected one but I migrate the valid ones that belong to another VMI.

In the example src-pvc-1/2/3 belongs to 2 VMIs (vm1 and vm2) that have valid migrated volumes, while vm3 has a rejected volume, therefore src-pvc-4 and src-pvc-5 are left untouched.

The status also gives the users a hint why the volume migration was rejected for vm3. They need to unset shareable for src-pvc-4, but src-pvc-5 is valid and that's why is listed under associateVolumes.

vasiliy-ul commented 10 months ago

Okay, then we probably misunderstood each other regarding 'leaving everything untouched' :)

If we can detect in advance that some part of the migration request cannot be fulfilled, do we really need to proceed with the rest? I mean, personally, I as a user would expect that a migration request should (ideally) complete successfully as a whole or fail as a whole. Well, realistically, we cannot guarantee it since e.g. a migration may fail for only one VM and succeed for the others. So it is a best-effort. But when we have rejected volumes, in my opinion, this is sort of a precondition for starting the whole thing. WDYT?

Maybe we could also consider adding a field in VolumeMigration CR like ignoreRejectedPVCs to allow users to state the intention explicitly?

alicefr commented 10 months ago

Okay, then we probably misunderstood each other regarding 'leaving everything untouched' :)

If we can detect in advance that some part of the migration request cannot be fulfilled, do we really need to proceed with the rest? I mean, personally, I as a user would expect that a migration request should (ideally) complete successfully as a whole or fail as a whole. Well, realistically, we cannot guarantee it since e.g. a migration may fail for only one VM and succeed for the others. So it is a best-effort. But when we have rejected volumes, in my opinion, this is sort of a precondition for starting the whole thing. WDYT?

Maybe we could also consider adding a field in VolumeMigration CR like ignoreRejectedPVCs to allow users to state the intention explicitly?

It is a possibility. I don't have a strong opinion about it. @awels what do you think?

awels commented 10 months ago

So I think that a higher level component will probably never generate a VolumeMigration that would include multiple VMs. It would generate one VolumeMigration per VM, that includes all the PVCs that are part of that VM. It will have the knowledge on how to do that. That way the result will be what both @alicefr and @vasiliy-ul expect (only entire VM migration is rejected and the VolumeMigration is rejected, if one of the PVCs is rejected).

I think it might even make sense to simply not start a migration if you detect multiple VMs. Some weird edge case will be a shared disk though. I don't know if dealing with just one VM per VolumeMigration makes it easier for you or not.

vasiliy-ul commented 10 months ago

I think it might even make sense to simply not start a migration if you detect multiple VMs.

I tend to agree here. By supporting migration of multiple VMs, VolumeMigration takes a lot of 'responsibility' and makes the usage of the API not intuitive. I would also suggest making it simpler and refuse migration of multiple VMs. At least at the first iteration. Potentially this can be added later if there is a demand from the users of the feature.

alicefr commented 10 months ago

So I think that a higher level component will probably never generate a VolumeMigration that would include multiple VMs. It would generate one VolumeMigration per VM, that includes all the PVCs that are part of that VM. It will have the knowledge on how to do that. That way the result will be what both @alicefr and @vasiliy-ul expect (only entire VM migration is rejected and the VolumeMigration is rejected, if one of the PVCs is rejected).

I think it might even make sense to simply not start a migration if you detect multiple VMs. Some weird edge case will be a shared disk though. I don't know if dealing with just one VM per VolumeMigration makes it easier for you or not.

Not quite sure if from the user perspective this is a nice option. They would need to group all the volumes per VMI. I know that we plan to add some additional logic in an overlaying tool to do this automatically, but I cannot assume it in this design. On the other hand, this will simplify the status report.

If we decided to deal with multiple VMs, however, we need to take into consideration, that there are possibly some pending volumes that could be triggered at a later time by launching a VMI. They could also be rejected. Another case, it is when one of the VM migration fails. The status of the VolumeMigration with mutliple VMs will never have a single state.

alicefr commented 10 months ago

I think it might even make sense to simply not start a migration if you detect multiple VMs.

I tend to agree here. By supporting migration of multiple VMs, VolumeMigration takes a lot of 'responsibility' and makes the usage of the API not intuitive. I would also suggest making it simpler and refuse migration of multiple VMs. At least at the first iteration. Potentially this can be added later if there is a demand from the users of the feature.

Ok, this simplifies a bit my life :). I'll update the design and the POC PR next week

awels commented 10 months ago

I think it might even make sense to simply not start a migration if you detect multiple VMs.

I tend to agree here. By supporting migration of multiple VMs, VolumeMigration takes a lot of 'responsibility' and makes the usage of the API not intuitive. I would also suggest making it simpler and refuse migration of multiple VMs. At least at the first iteration. Potentially this can be added later if there is a demand from the users of the feature.

Right this is my thinking as well. I am working on a tool that takes that responsibility from this API. For the first iteration, simpler is better IMO.

alicefr commented 10 months ago

I updated the proposal and the API with the focus on supporting a single VM migration per volume migration object.

alicefr commented 10 months ago

I have a question about the new controller. This proposal introduces a new controller for the new CRD. However, this controller is also responsible to update the VMI and VM spec. Would be this design reasonable or does anyone see any issue with this? AFAIU, today, every objet has its own controller that updates the fields. Triggering the VMI and VM controller seems more complex to me but if you see any issues I can also explore this option.

vasiliy-ul commented 10 months ago

I have a question about the new controller. This proposal introduces a new controller for the new CRD. However, this controller is also responsible to update the VMI and VM spec. Would be this design reasonable or does anyone see any issue with this? AFAIU, today, every objet has its own controller that updates the fields. Triggering the VMI and VM controller seems more complex to me but if you see any issues I can also explore this option.

I do not know the historical reasons behind the existing design, but keep in mind that if you modify VMI and VM objects from several controllers, you will need to ensure some sort of synchronization. Otherwise, you may introduce hard to debug race conditions.

So this seems reasonable to me:

today, every objet has its own controller that updates the fields.

alicefr commented 10 months ago

I have a question about the new controller. This proposal introduces a new controller for the new CRD. However, this controller is also responsible to update the VMI and VM spec. Would be this design reasonable or does anyone see any issue with this? AFAIU, today, every objet has its own controller that updates the fields. Triggering the VMI and VM controller seems more complex to me but if you see any issues I can also explore this option.

I do not know the historical reasons behind the existing design, but keep in mind that if you modify VMI and VM objects from several controllers, you will need to ensure some sort of synchronization. Otherwise, you may introduce hard to debug race conditions.

So this seems reasonable to me:

today, every objet has its own controller that updates the fields.

After thinking a bit more about it, I think it is possible.

Currently, I have all the logic in the controller for volume migration but I can move it to the VMI/VM controllers.

For the VMI, once the MigrateState is successful and there are migrated volumes in the status, then the VMI controller can update the VMI spec. For the VM, the VM controllers can react if the VMI has update its volumes in the spec. Then the VM controller can reflect the change in the VM too.

alicefr commented 10 months ago

Updated the design proposal with the last feedback on the API, corrected the typos and added examples for the pending and rejected volumes cases.

alicefr commented 10 months ago

@awels many thanks for the review! I have update the proposal accordingly,

@vasiliy-ul when you have some time, do you mind to have another look to the proposal? I'd like to consolidate the API and update the code accordingly.

alicefr commented 10 months ago

I update the proposal with the last feedback, set Delete as the default retain policy and updated the examples consequently

alicefr commented 9 months ago

@awels @vasiliy-ul @mhenriks I have updated the design proposal with the last feedback and discussion. Would you mind taking another look when you have time?

In particularly, I added more details about the DV handling in the design section and the limitations regarding gitops or similar declarative mechanism

alicefr commented 9 months ago

@fabiand @xpivarc @acardace @vladikr I think this design is reaching a quite stable phase. Could you please also review it and share feedback?

mhenriks commented 9 months ago

/lgtm

kubevirt-bot commented 9 months ago

New changes are detected. LGTM label has been removed.

alicefr commented 9 months ago

Thanks @alicefr this looks pretty neat!

I wonder if we really need a new CRD to manage this, what do you think about doing this transparently when you edit the VM spec now that we have the rollout strategy mechanism merged? If you were to edit the spec with an incompatible volume you'd just set the RestartRequired condition and wait for a restart, otherwise you'd migrate the VM swapping the old volume with the new one.

What do you think?

Many thanks for the review.

I don't know if what you are suggesting is possible.

First, it is only applicable to VMs (this might be still acceptable).

Second, what if the volume migration fails? You have modified the spec VM but you should roll back to the previous values. Additionally, the migration can take very long. What should we do in this case, apply the update of the spec only if the migration is successful after a long period, like 2 hours?

Third, the information in the status of the new CRD should be added to the VM status. This might be quite a lot of information to add there.

The last issue I can currently see with this approach is that you cannot specify additional fields like the sourceReclaimPolicy. Right now we don't have many , so we might decide a single behavior but I think we could need additional options in the future. Do you see any workarounds to these points?

UPDATE: On the other side, the big advantage of this approach is that it would be compatible with gitops. @awels @mhenriks what do you think?

alicefr commented 9 months ago

One additional thing I need to change is to replace the Phase with Conditions. Apparently, this is deprecated in kubernetes (see the documentation about the api conventions)

alicefr commented 9 months ago

@acardace What if we keep the new CRD but we add in a second step the option that an update of the VM spec can generate a volume migration object? The storage migration is complex enough to deserve its own API and CRD. WDYT?

acardace commented 9 months ago

Thanks @alicefr this looks pretty neat! I wonder if we really need a new CRD to manage this, what do you think about doing this transparently when you edit the VM spec now that we have the rollout strategy mechanism merged? If you were to edit the spec with an incompatible volume you'd just set the RestartRequired condition and wait for a restart, otherwise you'd migrate the VM swapping the old volume with the new one. What do you think?

Many thanks for the review.

I don't know if what you are suggesting is possible.

First, it is only applicable to VMs (this might be still acceptable).

That's how rollout strategy works, so I don't see an issue there.

Second, what if the volume migration fails? You have modified the spec VM but you should roll back to the previous values. Additionally, the migration can take very long. What should we do in this case, apply the update of the spec only if the migration is successful after a long period, like 2 hours?

We just keep retrying to migrate, we still aim for eventual consistency, and we should not roll anything back.

The change would be triggered by a patch in the VM spec so this is a good question! For CPU/Memory hotplug we publish a condition that the hotplug is in progress to later remove it when the migration has been completed.

Third, the information in the status of the new CRD should be added to the VM status. This might be quite a lot of information to add there.

I'd not consider this a blocker, CPU/Memory hotplug both add info they need to the status.

The last issue I can currently see with this approach is that you cannot specify additional fields like the sourceReclaimPolicy. Right now we don't have many , so we might decide a single behavior but I think we could need additional options in the future. Do you see any workarounds to these points?

You could decide to have this setting only at a cluster level, otherwise, you'd have to add this setting in the current Volume spec to have it per-VM, but the direction lately is to keep things simple with sane defaults.

UPDATE: On the other side, the big advantage of this approach is that it would be compatible with gitops. @awels @mhenriks what do you think?

yes, indeed.

mhenriks commented 9 months ago

@alicefr @acardace A couple notes on controlling migration via VM.

First, a cluster admin looking to migrate from storage class A to storage class B would likely prefer an api that is VM agnostic since they really only know/care about PVCs. Mapping PVCs to VMs is an extra step. But I don't think it is the intention that most users will invoke this API directly. Rather there will be come other operator/controller/CRD managing a bulk migration. So I think a VM based migration API is fine.

Now the question is how can we have a storage migration API that is gitops friendly? I don't think this is a straightforward as with mem/cpu hotplug. Migration is a two step process: 1) migrate to new volume 2) remove old volume and in order for migration to be compatible with gitops, both of those steps have to be driven by the user. I think we'd have to do something like this:

VirtualMachine at t0

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
  template:
     spec:
       domain:
         devices:
           disks:
           - disk:
               bus: virtio
             name: rootDisk 
       volumes:
        - name: rootDisk
          persistentVolumeClaim:
            claimName: vm-root
[...}

t1 - Specify migration volume

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
  template:
     spec:
       domain:
         devices:
           disks:
           - disk:
               bus: virtio
             name: rootDisk
           - disk:
               bus: virtio
             name: migratedDisk 
       volumes:
        - name: rootDisk
          persistentVolumeClaim:
            claimName: vm-root
       - name: migrationDisk
          volumeMigration:
            sourceClaim: vm-root
            targetClaim: vm-root-mig
[...}

t3 - migration completes

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
 status:
    volumeMigrationStatuses:
    - name: migrationDisk
      completed: true
      # other stuff

t4 - remove old volume

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
  template:
     spec:
       domain:
         devices:
           disks:
           - disk:
               bus: virtio
             name: rootDisk 
       volumes:
        - name: rootDisk
          persistentVolumeClaim:
            claimName: vm-root-mig
[...}
alicefr commented 9 months ago

Now the question is how can we have a storage migration API that is gitops friendly? I don't think this is a straightforward as with mem/cpu hotplug. Migration is a two step process: 1) migrate to new volume 2) remove old volume and in order for migration to be compatible with gitops, both of those steps have to be driven by the user. I think we'd have to do something like this:

What @acardace is suggesting I do think it is gitops friendly. For example the original VM specification is:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
  template:
     spec:
       domain:
         devices:
           disks:
           - disk:
               bus: virtio
             name: rootDisk 
       volumes:
        - name: rootDisk
          persistentVolumeClaim:
            claimName: src-pvc
[...}

and then the user changes to:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
  template:
     spec:
       domain:
         devices:
           disks:
           - disk:
               bus: virtio
             name: rootDisk 
       volumes:
        - name: rootDisk
          persistentVolumeClaim:
            claimName: dst-pvc
[...}

This update should trigger the volume migration under the hood.

In the case of datavolumes, the user should replace them in the volumes section with PVCs AND remove the data volume template.

Now, the missing piece here is handling the source volumes after the migration. In the current proposal, this is defined by the SourceReclaimPolicy and it would missing from the VM update. An option would be to futher simplify this and to delegate the source volume removal to the overlaying tool. @awels how does it sound this to you?

I personally prefer to avoid to put any knowledge of volume migration in the VM specification.

The other question I have would if we should preserve the Volume Migration CRD or completely remove it. This would lead to 2 paths how we can migrate the storage. Initially, we could avoid the introduction of the CRD and eventually add it in a second round?

How this sound to you?

awels commented 9 months ago

Now, the missing piece here is handling the source volumes after the migration. In the current proposal, this defined by the SourceReclaimPolicy and it would missing from the VM update. An option would be to futher simplify this and to delegate the source volume removal to the overlaying tool. @awels how does it sound this to you?

My tool already has to know this information because it would have to set the field in the CRD. So it is not that much more work to simply do the work instead of delegating it.

I personally prefer to avoid to put any knowledge of volume migration in the VM specification.

Agreed

mhenriks commented 9 months ago

This update should trigger the volume migration under the hood.

@alicefr I don't see how that scheme would work if the VM has multiple disks. Also what if someone wants to replace (not migrate) a disk with a different already existing/populated disk?

alicefr commented 9 months ago

This update should trigger the volume migration under the hood.

@alicefr I don't see how that scheme would work if the VM has multiple disks. Also what if someone wants to replace (not migrate) a disk with a different already existing/populated disk?

For multiple disks, I think, the user would need to patch the spec with multiple volumes at the same time.

As for replacing, I think we need to define what "replace" means. First, the volume migration would be triggered only if the VM is running. In order to replace an existing disk on a running VM, this would imply to unplug the old disk and to hotplug the new one. I agree this will be an hard and subtle difference to define. I think for replacing the same disk this should be 2 different updates and for volume migration should in the same request.

/cc @acardace @xpivarc this is what we have discussed this morning about hotplug

mhenriks commented 9 months ago

This update should trigger the volume migration under the hood.

@alicefr I don't see how that scheme would work if the VM has multiple disks. Also what if someone wants to replace (not migrate) a disk with a different already existing/populated disk?

For multiple disks, I think, the user would need to patch the spec with multiple volumes at the same time.

I guess can just assume the same order as before but I typically prefer explicit to implicit

As for replacing, I think we need to define what "replace" means. First, the volume migration would be triggered only if the VM is running. In order to replace an existing disk on a running VM, this would imply to unplug the old disk and to hotplug the new one. I agree this will be an hard and subtle difference to define. I think for replacing the same disk this should be 2 different updates and for volume migration should in the same request.

My point is mostly about the VM stopped case. I don't think a config change should do two different things bases on whether the VM is running or not. Someone could have their own scheme for backing up/restoring VM disks and just want to switch to an older version of the disk from backup

/cc @acardace @xpivarc this is what we have discussed this morning about hotplug

alicefr commented 9 months ago

hi,

There have been multiple discussions in parallel, and the current situation is pretty chaotic, so I'd like to provide an update on the current challenges and options.

The primary concern with the actual architecture is the order in which the VM specifications are updated with the destination volumes. In the current design, this occurs as the final step when the migration is successful. This appears to be an anti-pattern because it is not gitops friendly, but it also automatically modifies the VM specification without the user's explicit action.

An alternative to this behavior was suggested: triggers the volume migration when the user explicitly updates the volumes according to the VM specification. Unfortunately, this method may be confusing because it isn't easy to tell whether the user wished to replace an old disk with a new one, such as by unplugging and then hot-plugging or restarting the VM, or to make a complete duplicate of the disk with volume migration.

Ideally, the desired output of this discussion should be a clean approach to trigger volume migration when the volumes in the VM specification are updated.. Like changing the claimName

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
  template:
     spec:
       domain:
         devices:
           disks:
           - disk:
               bus: virtio
             name: rootDisk 
       volumes:
        - name: rootDisk
          persistentVolumeClaim:
            claimName: src-pvc
[...}

to:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
[...]
spec:
  template:
     spec:
       domain:
         devices:
           disks:
           - disk:
               bus: virtio
             name: rootDisk 
       volumes:
        - name: rootDisk
          persistentVolumeClaim:
            claimName: dst-pvc
[...}

I don't see a simple solution to this problem, however several possibilities were suggested:

  1. Triggering the volume migration when the VM spec update of volumes occurs AND the volume migration CR exists. To migrate storage, both conditions must be met; otherwise, the update action would be considered as a disk replacement.
  2. We do not touch the VM specification at all. The volume migration will only modify the VMI specification and will adhere to the flow proposed in the present architecture. In this case, the user must initiate the volume migration and monitor its progress before updating the volumes in the VM specification.

Both approaches are not optimal, and I see the following disadvantages:

Between the 2 current solutions, I would prefer the second option. Does anyone have better ideas? /cc @vladikr @acardace @mhenriks @awels @xpivarc

awels commented 9 months ago

Yeah thinking about it a bit more option 1 is too racy and one would have to define a specific order, which is probably not what we want. Not sure I like option 2 either. What happens if we create our CR, the VMI migrates and the storage there is changed, but then the user never updates the VM. We now have an inconsistent state. If the user shuts down the VM and starts it again, it would be back to the original state. We could at least track in the CR if the VM has been updated by the user or not. Give some indication there is something left to do?

mhenriks commented 9 months ago

Yeah thinking about it a bit more option 1 is too racy and one would have to define a specific order, which is probably not what we want. Not sure I like option 2 either. What happens if we create our CR, the VMI migrates and the storage there is changed, but then the user never updates the VM. We now have an inconsistent state. If the user shuts down the VM and starts it again, it would be back to the original state. We could at least track in the CR if the VM has been updated by the user or not. Give some indication there is something left to do?

Yeah with option 2, the VM is in a very precarious state for the time between the migration completes and spec is updated. To address this, the VM controller would have to be made aware of the volume migration CR and then we are back to option 1

alicefr commented 9 months ago

@awels @mhenriks, what I'm hearing here is that neither option 1 nor 2 are viable, am I correct?

mhenriks commented 9 months ago

@awels @mhenriks, what I'm hearing here is that neither option 1 nor 2 are viable, am I correct?

Not sure I'm ready to give up on option 2 yet. Maybe we can leverage VM status to make it viable? Perhaps by adding a storageMigrationInProgress field in VM status that refers to the name of migration CR. When the migration starts, the value is set and remains set until the user updates the vm spec to match the migration. So the storage migration controller is watching the VM for spec changes. And there is a finalizer on the migration until the VM spec is updated

There is already a similar field called restoreInProgress fro VM restores.

vladikr commented 9 months ago

From my point of view modifying the VMI spec is an ant pattern as well. I would prefer to avoid it.

I'm still puzzled about what is too racy with the first option. If I understood correctly, disk replacements are only done via restarts. Hence, if the VM is running and the VM spec is updated we can assume that this is a disk replacement and do nothing until the VM "restarts". During that time a Volume migration object may appear then we act. If the object is already there at the time of the VM change, we act immediately.

alicefr commented 9 months ago

race-solution1

@vladikr I think the problem arises, for example, when the VM is running and the volume migration object creation is slower then the update. I tried to sketch the 2 possible cases.

In case 1, when the server process the VM update can see the volume migration object and can interpret that it is a storage migration.

However, the case that concerns me it is the second case, if at the moment of the VM spec update the volume migration object isn't visible by the controller. In this case, the update is interpreted as a unplug of the old volume and an hotplug of the new one. In this case, the second pvc is empty as we should have copied the data.

I think there are other cases like this one.