Session corruption self-healing mechanism needed - Githubissues

tasket / wyng-backup

Fast backups for logical volumes & disk images

GNU General Public License v3.0

251 stars 16 forks source link

Session corruption self-healing mechanism needed #207

Closed tlaurion closed 5 months ago

tlaurion commented 5 months ago

It's been two times I'm having issues that I guess should not happen under normal circumstances but was able to deal with somehow when archives were unencrypted and can't anymore.

When sending new session, unfortunately in my PoC with lack of resource on the archive server backend, sometimes, mdadm backend and associated md0 (raid5) goes offline alongside of the mounted partition where the archives are supposed to be sent. When that happens, weirdly, its info files that are smaller than supposed, actually being empty files.

Of course, playing with the cotent on the encrypted archive side is kinda impossible. In this circumstances, attempts of removing dom0 /var/lib/wyng/* is not helping, and deleting the new session, reported with debug (mapping of volume and encrypted vol dir is outputted as well as where the new session dir is given) is not helping either.

I wish it was possible that arch-check could work and help under these circumstances, reporting and permitting interaction with end user to delete corrupted session, or that delete would work for the whole volume but wyng fails too early.

Up to now, I had to redo a whole arch-init and wipe old archives. I'm not sure this behavior is desired, since it is well possible that a backup session gets interrupted in the middle of it let it be for network failure or power outage reasons. In my case, it seems to be a corner case not yet isolated where something somehow uses more ram then available and OOM kills randomly, which has nothing to do with wyng here. But the fact that wyng cannot recover from this situation and requires to redo whole archive being sent seems to be a problem that should have some kind of recovering capabilities or self healing somehow.

Ideally, wyng would not fail here, mark the last session as bad, and delete it maybe? So that the end user can just send another new backup session to archive as if nothing happened.

Sorry I do not have traces here this time. I got impatient of moving my system from lvm to btrfs twice now, but if it happens again, I will report if more details are needed in this issue

tasket commented 5 months ago

@tlaurion Issue #201 was posted to address the way Wyng loads metadata so it can be more fault-tolerant (although I will say... that remote fs is certainly not helping with whatever it considers to be fault tolerance). Wyng is careful to record data first and metadata last (as .tmp files); this is important because it doesn't even try to finalize a change (by renaming .tmp files) until everything has transferred; so the archive would be blind to any changes until the last instant when 4 mv ops are executed. BTW, 0-byte files sounds like the problems I used to have 20ya running my systems on XFS (which back then had default tuning for data center use, which I believe used write-caching).

If recovering that specific archive interests you, I could add some features to the debug branch in the next day or two that should remove the offending session for you. It would help a lot though to know the error you're getting after you clear /var/lib/wyng and try to access the archive again.

tasket commented 5 months ago

Incidentally, Wyng has a --maxsync option which calls sync more often including on the remote fs. This might help avoid the problem (but so could turning off any write caching / delayed allocation features that are intended to run on expensive hardware).

tlaurion commented 5 months ago

Will remove fs optimizations.

tasket commented 5 months ago

@tlaurion I did just write a mod that works in debug mode. It should fix an archive if (and only if) the last session 'info' or manifest file is corrupt (won't work if the 'volinfo' or 'archive.ini' are affected). Let me know if you want to try it and I'll push it to debug.

tlaurion commented 5 months ago

@tlaurion I did just write a mod that works in debug mode. It should fix an archive if (and only if) the last session 'info' or manifest file is corrupt (won't work if the 'volinfo' or 'archive.ini' are affected). Let me know if you want to try it and I'll push it to debug.

I do not have the broken session I'm rebackuping everything which is still long since 1.7tb.

I think it was linked to misleaded /etc/fstab write cache, not your bug!

tasket commented 5 months ago

@tlaurion The 08wip branch now has the ability to remove corrupt sessions, simply using the arch-check command in attended mode. The integrity testing part of it has been tested, but I haven't yet tested removal of a corrupt session.

If a corrupt session is found, it invalidates any other session that comes after it. So the repair process is quite conventional in that it 'rewinds' a corrupted volume to the last known good state.

tasket commented 5 months ago

Implemented and tested.