tasket / wyng-backup

Fast backups for logical volumes & disk images
GNU General Public License v3.0
251 stars 16 forks source link

Add option for send to make a new full backup #139

Open keeperofdakeys opened 1 year ago

keeperofdakeys commented 1 year ago

Would it be possible to add an option to perform a new full backup of a volume? Or does this already exist in the form of "--remap"?

One of the dangers of delta backups is that after a year or two it's possible a bug in wyng or lvm could corrupt the delta block tracking. For example what happens if the system crashes during the middle of a wyng backup? So I'd like to perform an occasional full to ensure that the delta chain starts fresh occasionally.

tasket commented 1 year ago

I've thought about this in the past, but decided then against it because the focus was on deduplication and verifying integrity of existing data. I also wrote down a lot of why not reasons which I won't get into here and which seem to be shared by most (though not all) backup projects.

It should be noted that Wyng offers several ways to verify data, and the delta info you're concerned about is also the verification metadata (i.e. chunk hashes). If the delta info gets borked then anything in Wyng that reads volume data (receive, verify, diff, arch-check) will report an error. And the parts of Wyng that verify integrity are very well tested and scrutinized, as they're integral to the core send/receive functions.

You may be particularly interested in diff and arch-check: The former compares the latest session against your local volume snapshot directly, and the latter combs through the delta lists in the archive in addition to verifying data for every session. All of the above enable spot checks to whatever extent you see fit.

Wyng's format is robust but also relatively simple and you can validate archive data outside of Wyng. See the _wyngextract.sh script in 'misc' as an example of verifying data in bash.

send is one of the safer write operations in which to have a crash, as its only tacking-on an extra dir that means nothing to the archive until the instant when the volume's metadata file is replaced (basically doing an mv to replace an older file).


The future possibilities I see are:

  1. No change; use raid or similar storage layer for redundancy.

  2. Full-data backup sessions, as you suggest. To be really germane to your concern about integrity, however, the full-data session would have to look more like a walled-off area of the archive which can't be changed until its explicitly deleted.

  3. Support something like whole-archive 'snapshots' which would preserve an "epoch" of metadata+data unchanged within the archive while still allowing incremental backups to be added in newer epochs.

  4. Extend the hashing options so Wyng can, for example, show an entire volume's hash for any given session. This could be done for both data and entire metadata state (not just delta files) so that reading the volume data isn't the only proof that delta metadata wasn't mangled.

  5. Adding PAR or similar data redundancy.

For now, I see 1 & 4 as the near-term options and no. 4 provides additional ways Wyng can check its own work over time. The other options have limited value when its the correct function of the program that is in question; of course the simplest answer to that is to add dd | zstd >image.zst to your backup procedure.

keeperofdakeys commented 1 year ago

Thanks for the detailed response, your point of view makes a lot of sense.

The idea of occasional fulls comes from a more pragmatic point of view - any single bug in the program could lead to a loss of the backup chain. For example if you have two years worth of deltas, a corrupt delta from one year ago could mean that the last restorable backup is from one year ago. On the other hand a monthly full would ensure in the worst case you at least have a backup from one month ago. Not to mention that long backup chains often increase processing time of restores.

From a quick look at the code I'm seeing that a new backup session is made as a tmp directory before being moved into the correct place, and snapshots / deltamap are only rotated / cleaned after the backup is successful. So an interrupted backup would essentially get ignored (are the tmp directories ever cleaned up?).

For now I'll leave this open in case someone wants to look more into this. I'll be fine with pruning to a month worth of backups, and an arch-check once a month.

tasket commented 1 year ago

Yes, the sequence for write operations like send is carefully arranged. There is also a fair amount of prevention. For example, near the end of _sendvolume() the _check_manifestsequence() function is called to check that the new metadata (the deltas) meshes correctly with the old metadata and that the result is well-formed.

OTOH, the more risky write operation is prune. There are simply more moving parts in this process, so Wyng has more code to recognize and handle recovery from an interrupted pruning operation. In this case, the archive's header (archive.ini) is first updated with a tag indicating that a transaction is in process, with references to the specific data, before any changes are made to the data. When resuming after interruption, the transaction tag will enable Wyng to pick up where it left off.

What it boils down to is that every incremental backup program is like some type of database. There is a certain amount of complexity and write transactions carry some risk; I don't disagree with your concern there. But from a logical and practical standpoint, the best way to hedge against bugs in a program leading to data corruption is to occasionally use a much simpler full back system on the side, even something like dd.

tlaurion commented 1 year ago

Unfortunately, as of today, I cannot risk letting go of the 0.3.2 20220818 release exactly for opposed reasons of this ticket.

Wyng has been extremely stable for me with --dedup with incremental backups since the 0.3 release on qubes 4.0.

None of the deduped achieved restored ever failed on me, only having a single full, initial backup since my first use, having applied pruning here and there without any corruption.

I am planning to move to 0.4 for a while now, the only stopper being cached passphrase to restore multiple archives.

Commenting here on stability of the current codebase taking for granted that nothing much changed on that regard on 0.4 compared to 0.3 release.

Again, concerns are in regard to zstd stability vs bzip currently used, which if anything changed would cause breakage.

Otherwise, looking forward to move away of 0.3 but still relaying on backups of 0.3 which I restore for my use case really often (dev environments of way back which were compiled Heads stuff).

Pruning did not caused any issue. Again, doing send operations with --dedup --remap --clean on each operations to make sure all past .tick .tock inexistent volumes are cleaned.

I will document current setup in my archive remote directory and pass to 0.4 for current lvm volumes soon. Still waiting for passphrase caching to switch to 0.4 honestly, tough.

And if that is pushed forward, I will start using wyng to store on remote cloud.

@tasket anything moved forward on amazon/cloud providers to push backups in cheap cloud storage? That would move this project forward with more testers.

I would have loved that going abroad to FOSDEM and storing things on cloud instead of putting disks into cargo and having to restore upon arrival. Having private data encrypted in cloud and restoring those onto qubes from cloud and then pruning would have been amazing.

Again @tasket keep on the amazing work. This project is one I depend on and so many others are just waiting to discover this project. Native cloud support and passphrase caching are the two missing features for this to be widely tested and moved upstream in my opinion.

@keeperofdakeys note that instead of redoing a full backup, you should probably test integrity of your archives instead as part of your backup/recovery planning.

tlaurion commented 1 year ago

Just saw changes for caching passphrase to 0.4 for automated receive operations.

Will copy data safely to another external backup storage and give 0.4 a chance on ext4 storage without LUKS container, trusting dom0 to be able to pass passphrase directly through cron job for the future and open other issues for the qubes wrapper to add automatically new templates and new created qubes, which I do automatically from other manually crafted scripts.

Looking to dodge manual luks unlocking and mounting for a while (even though I could decrypt from adding to crypptab and Heads to automatically unlock which would resolve the issue, I never felt comfortable having backup volume unlocked. Still not really comfortable with having crontab having my backup passphrase but still unsure on how to dodge the problem: my archive containing volumes I do not want accessible from dom0 which I delete when in transit.)

@tasket if you have any idea in that, please feel free to open corresponding issues. Otherwise I will open them as I go.