How to populate multiple PVs in one pass?

rwmjones commented 2 years ago

We want to write a volume populator for virt-v2v, a tool which imports virtual machines from VMware. The disks of these virtual machines would be mapped to block-based PVs.

The hello world example is pretty good, and shows that we could fairly easily write a populator for single disk VMs. However there doesn't seem to be a way to populate multiple PVs at the same time, so multi-disk VMs couldn't be imported. Multi-disk VMs are not especially common, but they do sometimes exist and it would be a shame to limit this to single disk VMs.

Is there a way to import to multiple PVs using a single volume populator that we may have missed?

bswartz commented 2 years ago

The model for populators is that they extend the data source mechanism from just PVCs and VolumeSnapshots to be wide open. The underlying datasource mechanism is just for a single PVC, though. I think what you need is a high level mechanism that leverages populators underneath.

If you need to create 3 new disks, you could have a high level object+controller pair that spits out 3 PVCs with 3 different data sources, and let 3 populators do the legwork of filling in the data for those volumes. If there's only a single object that represents the source, you might need another layer of indirection for the data sources that allows a user to tell the system which of the 3 volumes this one is as population time.

rwmjones commented 2 years ago

It's not really possible to do the virt-v2v step multiple times because the changes made to the disks will be inconsistent, for example in the case where the disks are part of a RAID array or LV split over two disks. So I don't think the second approach is possible.

Can you explain a bit more about the "high level mechanism"? What sort of thing should I be looking at?

nyoxi commented 2 years ago

I believe this aligns with what I have already proposed. Forklift would provide the orchestration (the high level mechanism) here. It will create PVCs with data sources -- one per disk -- and wait for the data to be transferred by the volume populator. We could use the same populator to also store the VMX file (or maybe use some other mechanism to store VM metadata). When the PVCs are filled with data Forklift makes sure that virt-v2v is started with the disks and metadata to do the conversion.

The only draw back of the mechanism I can see so far is with the filesystem that spans multiple disks. The download would be less effective because we would have to skip the sparsification/trimming step as we don't see the whole filesystem.

rwmjones commented 2 years ago

virt-v2v really does not want to do in-place conversions, plus it's not at all efficient to do this. What would be better is some method to run virt-v2v and have it both connect to VMware and be able to populate all the PVCs at the same time.

At the moment I've got a volume populator which runs virt-v2v but only works if the guest has a single disk.

(Or we could change it to populate the disks into files in a single filesystem PVC, but I don't think KubeVirt has any way to boot this so it'd involve a second copy, not nice.)

bswartz commented 1 year ago

The approach I'd recommend is to split the problem into 2 steps: 1) First define a single object that represents multiple volume data sources and another object that represents a single request to clone the group of them. Build a controller that does this work up front before the volume populator gets involved. 2) Second have the controller generate individual data sources for each volume that was already cloned, and also PVCs referencing those data sources, and make the actual populator simply bind the PVC to the already-created volumes.

Kubernetes never needs to know that there was a relationship between the volumes, and neither does the volume populator, as long as you have a higher-level controller do the actual work beforehand. I think of this as a "meta populator". The key is to have a way to represent the cloned volume in the intermediate period between when the meta populator creates them and when Kubernetes learns about them in the form of PVs.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-csi/lib-volume-populator/issues/40#issuecomment-1564735476): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-csi / lib-volume-populator

How to populate multiple PVs in one pass? #40