aesteve-rh commented 1 year ago

Prune stale bitmaps before a live merge to avoid failing with ENOSPC.

Fixes: #352 Signed-off-by: Albert Esteve aesteve@redhat.com

aesteve-rh commented 1 year ago

/ost

aesteve-rh commented 1 year ago

Currently I am trying to verify the fix with a stress test. But I post the PR already to refine it early, so that we can integrate as soon as I manage to reproduce the issue (if so).

aesteve-rh commented 1 year ago

/ost

aesteve-rh commented 1 year ago

/ost

aesteve-rh commented 1 year ago

/ost

aesteve-rh commented 1 year ago

I tried to reproduce the issue, both manually and with ovirt-stress, but I couldn't get to see the "No space left on device" in the libvirtd logs on a full setup with two hosts and a running VM. Unfortunately, I can not spend more time with it.

So I would finish the review of this PR, and integrate it. Hopefully on qe they have better luck reproducing.

nirs commented 1 year ago

I tried to reproduce the issue, both manually and with ovirt-stress, but I couldn't get to see the "No space left on device" in the libvirtd logs on a full setup with two hosts and a running VM.

How did you try to reproduce?

When we start live merge we measure the top volume to get the required size of base. This measure the data and bitmaps in base. https://github.com/oVirt/vdsm/blob/6f67a73dac7c02009c0125a59e002590f9cc2beb/lib/vdsm/virt/livemerge.py#L405

If base size is smaller than the measured data, we don't extend base. https://github.com/oVirt/vdsm/blob/6f67a73dac7c02009c0125a59e002590f9cc2beb/lib/vdsm/virt/livemerge.py#L457

So we need to find how much data we must have in top so the measured size is exactly base size. Then add eough stale bitmaps to base (we needed 6 bitmaps in the tests to reproduce this).

You can do the measurements with the reproducer script, using 2.5g images (instead of 128m). When you find the right size, do:

Add second disk to running vm using qcow2 format
- This disk will start as 2.5g empty disk
Create a snapshot (snap1)
- This will create empty 2.5g snapshot
Fill this snapshot with the right amount of data so merge should succeed
- Can be done inside the vm by writing random data directly to the second disk
```
dd if=/dev/urandom bs=1M count=xxx of=/dev/vdb oflag=direct conv=fsync
```
  writing zeros will be faster but when committing the data qemu detects zeros and convert them to zero clusters, which will prevent reproduction of this issue.
The snapshot will be extend to 5g during the write - this is fine and does not affect the test
Create another snapshot (snap2)
- now we can merge the middle layer into the base
Add enough stale bitmaps using qemu-img bitmap add to the base volume
In engine UI, delete snapshot snap1
- this merges contents of the middle layer (snap2) into the base layer (snap1)
- it should fail because of the stale bitmaps

I think we need to reproduce this first before we modify the code. If you don't have time, maybe someone from QE has time to reproduce.

aesteve-rh commented 1 year ago

It is exactly how I tried.

First I tried with 3GiB disk, but was not fully provisioned. Vdsm's created image was 2.5GiB, but virtual size was 3GiB, so I changed it to 2.5GiB (2560MiB). Didn't get it either, and I even tried with 128MiB as is what we used for cold merges. No luck.

Tried to write directly in the VM (to be fair I did only try with /dev/zero, so today I retried with /dev/urandom), or with qemu-io in the host directly in the top volume:

qemu-io -f qcow2 {top_path} -c "write 0 2000M"
qemu-io -f qcow2 {top_path} -c "write 2000M 558M"

I did try also with more snapshots, removing the first or the second snapshot (these tests helped understand the process better, actually), automatically and manually.

None of these really worked. At least in what my goal was: I didn't get the qemu error in journactl output in any case.

I will document the steps and hand over to QE, I really can not put much more time into trying. I wanted to move on completely today, but still dedicated the morning to try with the /dev/urandom proposal, with no luck either.

My feeling with the patch is that it should work, same as we fixed the cold merges: basically runs the same operation as with cold merges, and the test proves that it runs when required. If OST runs pass (it does not degrade), I'm good with it.

nirs commented 1 year ago

It is exactly how I tried.

First I tried with 3GiB disk, but was not fully provisioned. Vdsm's created image was 2.5GiB, but virtual size was 3GiB, so I changed it to 2.5GiB (2560MiB). Didn't get it either, and I even tried with 128MiB as is what we used for cold merges. No luck.

Tried to write directly in the VM (to be fair I did only try with /dev/zero,

This cannot work, qemu detects the zeros during the merge and does not write any data, so merge cannot fail.

...

None of these really worked. At least in what my goal was: I didn't get the qemu error in journactl output in any case.

Getting the error is not reliable way to detect the issue. The issue is detected by checking the merge logs. When the merge complete after an error, the volume is not removed from the chain, and engine will report the the snapshot is illegal.

When this happens, trying to delete the snapshot again is expected to fail forever.

aesteve-rh commented 1 year ago

I have finally been able to consistently reproduce the issue with 2560MiB disks.

Reproducing steps

Create snap1 A (base) < B (active)

Write 2555MiB from the running VM:

dd if=/dev/urandom bs=1M count=2555 of=/dev/vda oflag=direct conv=fsync

Create snap2 A (base) < B (top) < C (active)

Add 70 bitmaps to base image from the Host:

for i in $(seq 70); do
qemu-img bitmap --add /dev/<vg_name>/<lv_name> stale-bitmap-$i
done

Results in: Top image (snap2) measure:

required size: 2679767040
fully allocated size: 2685009920
bitmaps size: 0

Base image (snap1) measure:

required size: 655360
fully allocated size: 2685009920
bitmaps size: 9240576

Remove snap1 from engine UI to trigger the internal merge.

Error case (without the fix in this PR)

Vdsm logs:

2022-12-05 18:21:39,245+0100 ERROR (libvirt/events) [virt.vm] (vmId='bb1ff201-a159-4436-a23f-8dad4cced1e9') Block job 89667ef9-0b0f-4b63-8ed9-9a720e0d4845 type COMMIT for drive vdb has failed (vm:5955)

Resulting base measure:

required size: 2675048448
fully allocated size: 2685009920
bitmaps size: 9240576

Engine UI:

If we skip adding the bitmaps the merge succeeds, so it is an error caused by the stale bitmaps.

Success case (with the fix in this PR):

Vdsm logs.

After START merge and before measuring the relevant disk, we see the prune_bitmaps logs:

2022-12-07 09:49:47,864+0100 WARN  (jsonrpc/2) [storage.bitmaps] Prune stale bitmaps ['stale-bitmap-1', 'stale-bitmap-2', 'stale-bitmap-3', 'stale-bitmap-4', 'stale-bitmap-5', 'stale-bitmap-6', 'stale-bitmap-7', 'stale-bitmap-8', 'stale-bitmap-9', 'stale-bitmap-10', 'stale-bitmap-11', 'stale-bitmap-12', 'stale-bitmap-13', 'stale-bitmap-14', 'stale-bitmap-15', 'stale-bitmap-16', 'stale-bitmap-17', 'stale-bitmap-18', 'stale-bitmap-19', 'stale-bitmap-20', 'stale-bitmap-21', 'stale-bitmap-22', 'stale-bitmap-23', 'stale-bitmap-24', 'stale-bitmap-25', 'stale-bitmap-26', 'stale-bitmap-27', 'stale-bitmap-28', 'stale-bitmap-29', 'stale-bitmap-30', 'stale-bitmap-31', 'stale-bitmap-32', 'stale-bitmap-33', 'stale-bitmap-34', 'stale-bitmap-35', 'stale-bitmap-36', 'stale-bitmap-37', 'stale-bitmap-38', 'stale-bitmap-39', 'stale-bitmap-40', 'stale-bitmap-41', 'stale-bitmap-42', 'stale-bitmap-43', 'stale-bitmap-44', 'stale-bitmap-45', 'stale-bitmap-46', 'stale-bitmap-47', 'stale-bitmap-48', 'stale-bitmap-49', 'stale-bitmap-50', 'stale-bitmap-51', 'stale-bitmap-52', 'stale-bitmap-53', 'stale-bitmap-54', 'stale-bitmap-55', 'stale-bitmap-56', 'stale-bitmap-57', 'stale-bitmap-58', 'stale-bitmap-59', 'stale-bitmap-60', 'stale-bitmap-61', 'stale-bitmap-62', 'stale-bitmap-63', 'stale-bitmap-64', 'stale-bitmap-65', 'stale-bitmap-66', 'stale-bitmap-67', 'stale-bitmap-68', 'stale-bitmap-69', 'stale-bitmap-70'] from '/rhev/data-center/mnt/blockSD/a2ad7a23-96c5-4c23-9007-c0d1892ea07d/images/d4e763f5-67d8-4a73-9c3b-79da330032e5/0033c0b7-69f3-43fa-b36c-8c0a518aea84' (bitmaps:144)
2022-12-07 09:49:47,878+0100 INFO  (jsonrpc/2) [storage.bitmaps] Remove bitmap stale-bitmap-1 from '/rhev/data-center/mnt/blockSD/a2ad7a23-96c5-4c23-9007-c0d1892ea07d/images/d4e763f5-67d8-4a73-9c3b-79da330032e5/0033c0b7-69f3-43fa-b36c-8c0a518aea84' (bitmaps:117)
...

Vdsm measures expected data:

FINISH measure return={'result': {'bitmaps': 0, 'required': 2679767040, 'fully-allocated': 2685009920}} from=::ffff:192.168.122.96,38934, flow_id=13305ae3-0084-4eed-9f5a-ec31a43424ff, task_id=b77b670e-79ae-4cad-8d64-71b6259b83b8 (api:37)

And merge finishes correctly, no ERRORs in the logs.

Other disks also have a start prune_bitmaps, without the bitmaps removal WARN logs:

INFO  (jsonrpc/3) [vdsm.api] START prune_bitmaps(sdUUID='a2ad7a23-96c5-4c23-9007-c0d1892ea07d', imgUUID='c7b5c2f6-87ca-4ba6-a498-e12025f0ab12', volUUID='285874a4-0fc6-4bcf-9afc-696f40dd12bc', baseUUID='7f2f1e6e-d32d-4883-941e-836b77c4dde3') from=::ffff:192.168.122.96,38934, flow_id=13305ae3-0084-4eed-9f5a-ec31a43424ff, task_id=55136c39-aacb-4c02-9d8f-17d93a763b5f (api:31)
...
INFO  (jsonrpc/3) [vdsm.api] FINISH prune_bitmaps return=None from=::ffff:192.168.122.96,38934, flow_id=13305ae3-0084-4eed-9f5a-ec31a43424ff, task_id=55136c39-aacb-4c02-9d8f-17d93a763b5f (api:37)

Resulting base measure (bitmaps removed):

required size: 2679767040
fully allocated size: 2685009920
bitmaps size: 0

Engine logs:

INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [13305ae3-0084-4eed-9f5a-ec31a43424ff] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_SUCCESS(356), Snapshot 'snap 1' deletion for VM 'internal-livemerge-vm' has been completed.

Also disappeared in the engine UI, and the task finished successfully.

aesteve-rh commented 1 year ago

/ost

aesteve-rh commented 1 year ago

/ost

michalskrivanek commented 1 year ago

/ost

aesteve-rh commented 1 year ago

/ost

aesteve-rh commented 1 year ago

/ost

aesteve-rh commented 1 year ago

/ost

oVirt / vdsm

livemerge: prune bitmaps before measuring #357

Reproducing steps

Error case (without the fix in this PR)

Success case (with the fix in this PR):