tasket / wyng-backup

Fast backups for logical volumes & disk images
GNU General Public License v3.0
250 stars 16 forks source link

Cannot backup a restored system: "Acquiring deltas.fs gen ids differ" #209

Closed tlaurion closed 1 month ago

tlaurion commented 5 months ago
sudo wyng-util-qubes --dest qubes-ssh://$wyng_qube:root@$ssh_server$archive_path --authmin 1080 --includes -w remap -w opt-ssh=-C --dedup backup -wtag='sys-print,working' -wtag='qusal,working state' --debug
wyng-util-qubes v0.9 beta rel 20240531
['/usr/sbin/wyng', '--dest=qubes-ssh://wyng-rpi5_raid5:root@rpi5/mnt/Backups/nv41', '--authmin=1080', '--remap', '--opt-ssh=-C', '--tag=sys-print,working', '--tag=qusal,working state', '--json', 'list']

Selected for backup:
[...]

 Skipping dom0
[...]
  Queuing full scan of import 'wyng-qubes-metadata'
Acquiring deltas.fs gen ids differ.

Is there a workaround to this @tasket ?

tasket commented 5 months ago

@tlaurion What happens when you retry?

This error happens when Btrfs online maintenance (such as balancing) causes the subvol's btree transaction number to change while Wyng is still acquiring metadata.

Completing, stopping or pausing the balance (or other online maint) should make the subvol quiet enough for Wyng to finish acquisition. Also, the fewer volumes you backup in one invocation, the faster Wyng's acquisition runs and theoretically there is less chance of the subvol experiencing coincidental transactions.

There is a number of things Wyng might do someday to make Btrfs acquisition more robust (retry instead of fail, retry one vol at a time, or fallback to full scan mode) but I've only seen this error once... on my own system when I intentionally tried to trigger it.

tasket commented 5 months ago

Also, this is not to imply Wyng doesn't already take precautions: Requiring the local path to reference a subvol is one, since that narrows the scope of objects affected by a maint op at any given time. Another is that the reflink vol snapshots are created as read-only and then a read-only snapshot of the subvol is created and used for the acquisition.

tlaurion commented 5 months ago

@tasket creating the btrfs snapshot is taking forever (maybe because of multiple clones, no bees for the moment) and then seems to fail right after it?

As said, I have restored btrfs backups over new qubesos installatio and this is kinda new that this error is popping and persists between attempts, where before as you said, it was transient and disappeared in second run.

How to share logs with you securely? Here in armored text files against your public key in this repo? Notes: I just extensively test BTRFS now without bees to finally see overamplification you were talking about in other threads.

As can be seen here with gray area on each of the CPU, you can see that host is 100% in IOWAIT here, were when shutdown of vm (linked to rotating images for qvm-volume restore), sometimes its at 1200%. Nothing more here over default QOS deployment under BTRFS, have also reverted having seperate BTRFS filesystem to test BTRFS solely first and then will deploy bees on top of it when I will tunefs properly. Will open another issue to tackle upstreaming better BTRFS mkfs options and tuning.

See btrfs subvolume creating stalling here and iowait and gray area on all cores prior of error thrown persistently: 2024-06-20-130928

tlaurion commented 5 months ago

@tasket : might help:

Ran in extra verbose and debug: sudo wyng-util-qubes --dest qubes-ssh://$wyng_qube:root@$ssh_server$archive_path --authmin 1080 --includes -w remap -w opt-ssh=-C --dedup backup -wtag='sys-print,working' -wtag='qusal,working state' --debug --verbose -wdebug -wverbose

Acquiring deltas.*generation b'ID 256 gen 8192 top level 5 path root\nID 257 gen 13 top level 256 path root/var/lib/portables\nID 258 gen 8192 top level 256 path root/var/lib/qubes\nID 277 gen 8192 top level 256 path root/var/lib/qubeswyng_snapshot_tmp\n'
Created subvol snapshot at /var/lib/qubeswyng_snapshot_tmp/
*generation b'ID 256 gen 8193 top level 5 path root\nID 257 gen 13 top level 256 path root/var/lib/portables\nID 258 gen 8193 top level 256 path root/var/lib/qubes\nID 277 gen 8192 top level 256 path root/var/lib/qubeswyng_snapshot_tmp\n'
fs gen ids differ.
tlaurion commented 5 months ago

Also note

[user@dom0 ~]
$ qvm-ls | wc -l
53              
$ qvm-ls | grep TemplateVM | wc -l
13              
[user@dom0 ~]
$ qvm-ls | grep AppVM | wc -l
29              

So many clones of templates on eflink pool, without having applied any tuning for btrfs so defaults + equivalent of your now deployedd misc script under recent commit to have varlibqubes as btrfs subvol, that's it.

tasket commented 4 months ago

@tlaurion Wyng logs are not going to provide much guidance here: what Wyng it doing is rather simple in polling a subvol property at the start, and then again at the end of the procedure.

I also don't think tuning has much to do with it. My daily laptop has a tuned Btrfs, but I have two other systems that are just using defaults.

The biggest variable may be system speed: My daily laptop is pretty fast and it accounts for about 90% of the testing. If you're using Ivy Bridge or similar vintage then maybe there is a race condition creeping in due async fs process not finishing quickly. You could try this sync-and-pause patch on Wyng beta rel. 20240610 to see if it helps:

1462a1463
>             fssync(dest, force=True, sync=True)     ; time.sleep(3)
tlaurion commented 4 months ago

@tlaurion Wyng logs are not going to provide much guidance here: what Wyng it doing is rather simple in polling a subvol property at the start, and then again at the end of the procedure.

I also don't think tuning has much to do with it. My daily laptop has a tuned Btrfs, but I have two other systems that are just using defaults.

The biggest variable may be system speed: My daily laptop is pretty fast and it accounts for about 90% of the testing. If you're using Ivy Bridge or similar vintage then maybe there is a race condition creeping in due async fs process not finishing quickly. You could try this sync-and-pause patch on Wyng beta rel. 20240610 to see if it helps:

1462a1463
>             fssync(dest, force=True, sync=True)     ; time.sleep(3)

Will try patch later on, Afk. But this is nv41, Intel 12th gen and 12 cores.

tasket commented 4 months ago

@tlaurion Nevermind, I think its a path handling bug. Testing now...

tasket commented 4 months ago

@tlaurion @kennethrrosen Addressing the "fs gen ids differ" error:

btrfs subvolume list <path> was returning a list with multiple paths, and a / was misplaced in the path.

A fix has been posted to the '08wip' branch.

tlaurion commented 4 months ago

Worked on second run! Thanks!

kennethrrosen commented 4 months ago

Worked on second run! Thanks!

Same. Cheers @tasket !