Closed tlaurion closed 1 month ago
@tlaurion What happens when you retry?
This error happens when Btrfs online maintenance (such as balancing) causes the subvol's btree transaction number to change while Wyng is still acquiring metadata.
Completing, stopping or pausing the balance (or other online maint) should make the subvol quiet enough for Wyng to finish acquisition. Also, the fewer volumes you backup in one invocation, the faster Wyng's acquisition runs and theoretically there is less chance of the subvol experiencing coincidental transactions.
There is a number of things Wyng might do someday to make Btrfs acquisition more robust (retry instead of fail, retry one vol at a time, or fallback to full scan mode) but I've only seen this error once... on my own system when I intentionally tried to trigger it.
Also, this is not to imply Wyng doesn't already take precautions: Requiring the local path to reference a subvol is one, since that narrows the scope of objects affected by a maint op at any given time. Another is that the reflink vol snapshots are created as read-only and then a read-only snapshot of the subvol is created and used for the acquisition.
@tasket creating the btrfs snapshot is taking forever (maybe because of multiple clones, no bees for the moment) and then seems to fail right after it?
As said, I have restored btrfs backups over new qubesos installatio and this is kinda new that this error is popping and persists between attempts, where before as you said, it was transient and disappeared in second run.
How to share logs with you securely? Here in armored text files against your public key in this repo? Notes: I just extensively test BTRFS now without bees to finally see overamplification you were talking about in other threads.
As can be seen here with gray area on each of the CPU, you can see that host is 100% in IOWAIT here, were when shutdown of vm (linked to rotating images for qvm-volume restore), sometimes its at 1200%. Nothing more here over default QOS deployment under BTRFS, have also reverted having seperate BTRFS filesystem to test BTRFS solely first and then will deploy bees on top of it when I will tunefs properly. Will open another issue to tackle upstreaming better BTRFS mkfs options and tuning.
See btrfs subvolume creating stalling here and iowait and gray area on all cores prior of error thrown persistently:
@tasket : might help:
Ran in extra verbose and debug:
sudo wyng-util-qubes --dest qubes-ssh://$wyng_qube:root@$ssh_server$archive_path --authmin 1080 --includes -w remap -w opt-ssh=-C --dedup backup -wtag='sys-print,working' -wtag='qusal,working state' --debug --verbose -wdebug -wverbose
Acquiring deltas.*generation b'ID 256 gen 8192 top level 5 path root\nID 257 gen 13 top level 256 path root/var/lib/portables\nID 258 gen 8192 top level 256 path root/var/lib/qubes\nID 277 gen 8192 top level 256 path root/var/lib/qubeswyng_snapshot_tmp\n'
Created subvol snapshot at /var/lib/qubeswyng_snapshot_tmp/
*generation b'ID 256 gen 8193 top level 5 path root\nID 257 gen 13 top level 256 path root/var/lib/portables\nID 258 gen 8193 top level 256 path root/var/lib/qubes\nID 277 gen 8192 top level 256 path root/var/lib/qubeswyng_snapshot_tmp\n'
fs gen ids differ.
Also note
[user@dom0 ~]
$ qvm-ls | wc -l
53
$ qvm-ls | grep TemplateVM | wc -l
13
[user@dom0 ~]
$ qvm-ls | grep AppVM | wc -l
29
So many clones of templates on eflink pool, without having applied any tuning for btrfs so defaults + equivalent of your now deployedd misc script under recent commit to have varlibqubes as btrfs subvol, that's it.
@tlaurion Wyng logs are not going to provide much guidance here: what Wyng it doing is rather simple in polling a subvol property at the start, and then again at the end of the procedure.
I also don't think tuning has much to do with it. My daily laptop has a tuned Btrfs, but I have two other systems that are just using defaults.
The biggest variable may be system speed: My daily laptop is pretty fast and it accounts for about 90% of the testing. If you're using Ivy Bridge or similar vintage then maybe there is a race condition creeping in due async fs process not finishing quickly. You could try this sync-and-pause patch on Wyng beta rel. 20240610 to see if it helps:
1462a1463
> fssync(dest, force=True, sync=True) ; time.sleep(3)
@tlaurion Wyng logs are not going to provide much guidance here: what Wyng it doing is rather simple in polling a subvol property at the start, and then again at the end of the procedure.
I also don't think tuning has much to do with it. My daily laptop has a tuned Btrfs, but I have two other systems that are just using defaults.
The biggest variable may be system speed: My daily laptop is pretty fast and it accounts for about 90% of the testing. If you're using Ivy Bridge or similar vintage then maybe there is a race condition creeping in due async fs process not finishing quickly. You could try this sync-and-pause patch on Wyng beta rel. 20240610 to see if it helps:
1462a1463 > fssync(dest, force=True, sync=True) ; time.sleep(3)
Will try patch later on, Afk. But this is nv41, Intel 12th gen and 12 cores.
@tlaurion Nevermind, I think its a path handling bug. Testing now...
@tlaurion @kennethrrosen Addressing the "fs gen ids differ" error:
btrfs subvolume list <path>
was returning a list with multiple paths, and a / was misplaced in the path.
A fix has been posted to the '08wip' branch.
Worked on second run! Thanks!
Worked on second run! Thanks!
Same. Cheers @tasket !
Is there a workaround to this @tasket ?