oVirt / vdsm

The Virtual Desktop Server Manager
GNU General Public License v2.0
160 stars 201 forks source link

storage: Use offset on (cold) merge instead of measure required size. #418

Open dupondje opened 2 months ago

dupondje commented 2 months ago

When we commit a snapshot into a base volume, we first measure do a measure and then extend the base image with the required size. But this is not a valid assumption.

The required size output from measure gives you the size that would be required for a NEW volume, but not the size required to commit the top volume into the base volume.

While the final required since on a new volume might be lower than the size of the base volume. It might be required to extend the base volume to fit the comitted data.

See for example: https://gitlab.com/qemu-project/qemu/-/issues/2369

Base vol: 13421772800 bytes / end offset 13421772800 Top vol: 6174015488 bytes / end offset 3488350208

Measure gives us required size: 3491168256

But if we commit the top volume, the base volume grows to 13504806912 bytes.

Therefor we check the end offset of each volume, and the sum of that is the required size. We might allocate to much here, but the finalize shrinks the LV again to the end offset of the commited base image. So only some additional space during commit is needed.

dupondje commented 2 months ago

Still need to work on this further, but it's clear that the current way of calculation is not always right. I also reported an issue on qemu to have a better measure. See https://gitlab.com/qemu-project/qemu/-/issues/2369

For cold merge I think we shouldn't care to much about over-extending, as we reduce the size anyway. But for live merge, we should implement extend during merge to handle this properly I think.

@nirs : as this was your area, what are your thoughts on this?

nirs commented 2 months ago

Still need to work on this further, but it's clear that the current way of calculation is not always right. I also reported an issue on qemu to have a better measure. See https://gitlab.com/qemu-project/qemu/-/issues/2369

It seems like you found a case when the calculation is incorrect. If I understand the qemu issue correctly this happens when the snapshot is discarded before the commit?

The issue seems to be that discarded clusters are not considered since they hide the same clusters in the base image. If you copy the image, these clusters would not exist in the copy. But when we commit the top image in the base the commit should also discard the same clusters in the base image. So this may be an issue with commit, not only in measure.

The first step in handling this issue to add a vdsm issue for this, and add a failing test reproducing the bad measure or the wrong commit.

For cold merge I think we shouldn't care to much about over-extending, as we reduce the size anyway.

For cold merge we cared about over-extending, since this can fail the merge when the storage domain does not have enough available space.

But for live merge, we should implement extend during merge to handle this properly I think.

Extending during merge will be very hard to do, the exiting code is complicated enough without it.

I wold try to find a better way than using image end offset - the offset can be larger than needed since qcow2 never decrease the offset after clusters are discarded. Can we measure the top and base separately instead of using image end offset?

So until we have a fix in qemu-img measure or qemu-img commit (or both), we can do something like:

required = measure(base) + measure(top)

Need testing to see if this give better results than:

required = offset(base) + offset(top)

Also worth to discuss this with Kevin or Hanna from qemu, maybe start a thread about this in qemu-discuss or qemu-block mailing lists?

dupondje commented 2 months ago

Still need to work on this further, but it's clear that the current way of calculation is not always right. I also reported an issue on qemu to have a better measure. See https://gitlab.com/qemu-project/qemu/-/issues/2369

It seems like you found a case when the calculation is incorrect. If I understand the qemu issue correctly this happens when the snapshot is discarded before the commit?

This happens when you have ZERO but allocated clusters in the base image. Measure thinks: zero -> will take no space -> don't increment space needed for those blocks. But, with discard-no-unref, the ZERO but allocated clusters still have a host offset, so committing the snapshot into the base, will not reuse those clusters to commit new data to it, because they are already allocated. It's not because they are ZERO, they can be used for new data (only for new data on the same offset).

The issue seems to be that discarded clusters are not considered since they hide the same clusters in the base image. If you copy the image, these clusters would not exist in the copy. But when we commit the top image in the base the commit should also discard the same clusters in the base image. So this may be an issue with commit, not only in measure.

The first step in handling this issue to add a vdsm issue for this, and add a failing test reproducing the bad measure or the wrong commit.

Ack

For cold merge I think we shouldn't care to much about over-extending, as we reduce the size anyway.

For cold merge we cared about over-extending, since this can fail the merge when the storage domain does not have enough available space.

Clear

But for live merge, we should implement extend during merge to handle this properly I think.

Extending during merge will be very hard to do, the exiting code is complicated enough without it.

I wold try to find a better way than using image end offset - the offset can be larger than needed since qcow2 never decrease the offset after clusters are discarded. Can we measure the top and base separately instead of using image end offset?

This won't work, as measure does not count zero but allocated clusters (aka clusters with a host offset). They only way to do it without qemu patch is doing qemu-img map and calculate it ourselves ... seems like crazy :)

So until we have a fix in qemu-img measure or qemu-img commit (or both), we can do something like:

required = measure(base) + measure(top)

Need testing to see if this give better results than:

required = offset(base) + offset(top)

Also worth to discuss this with Kevin or Hanna from qemu, maybe start a thread about this in qemu-discuss or qemu-block mailing lists?

I expected some response on my patch, but not yet :)

nirs commented 2 months ago

Still need to work on this further, but it's clear that the current way of calculation is not always right. I also reported an issue on qemu to have a better measure. See https://gitlab.com/qemu-project/qemu/-/issues/2369

It seems like you found a case when the calculation is incorrect. If I understand the qemu issue correctly this happens when the snapshot is discarded before the commit?

This happens when you have ZERO but allocated clusters in the base image. Measure thinks: zero -> will take no space -> don't increment space needed for those blocks. But, with discard-no-unref, the ZERO but allocated clusters still have a host offset, so committing the snapshot into the base, will not reuse those clusters to commit new data to it, because they are already allocated. It's not because they are ZERO, they can be used for new data (only for new data on the same offset).

Can you create a minimal test case with issue? I think an image with one cluster at top and base should be good enough to demonstrate this issue. This can be probably easy to do as failing test in tests/storage/qemuimg_test.py, but for discussion with qemu folks, we need a simple shell script demonstrating the issue.

Then we need to discuss the expected behavior or measure and commit with qemu folks.

I wold try to find a better way than using image end offset - the offset can be larger than needed since qcow2 never decrease the offset after clusters are discarded. Can we measure the top and base separately instead of using image end offset?

This won't work, as measure does not count zero but allocated clusters (aka clusters with a host offset). They only way to do it without qemu patch is doing qemu-img map and calculate it ourselves ... seems like crazy :)

Before qemu-img measure was published, vdsm implemented a measure internally: https://github.com/oVirt/vdsm/blob/d5594b27a5443e7d55ca12e743c1e3c8800caec9/lib/vdsm/storage/qcow2.py

So this is not that crazy, if qemu-img gives us the info we can create a special measure command for commit use case before we have a fix from qemu.

An example code showing how we can measure the commit correctly using qemu-img will also help qemu folks to fix qemu-img measure, or help us to prepare a patch that make qemu-img measure work for commit use case.