tasket / wyng-backup

Fast backups for logical volumes & disk images
GNU General Public License v3.0
244 stars 16 forks source link

arch-dedup & dedup being made for LVMs sharing same areas automatically? #64

Open tlaurion opened 4 years ago

tlaurion commented 4 years ago

Hey @tasket

Was wondering what was actually done now out of the box for dedupping when sending, since applying arch-dedup frees a lot of space. Can you develop a bit on that and point into code?

Thanks!

tasket commented 4 years ago

Hi @tlaurion

The default is "dedup level 1" which means wyng send checks to see if the updated chunk hash is the same for that particular address as it was during the last backup (that's why send now merges manifests into a 'fullmanifest' of the volume before sending, unless dedup level has been set to 0).

This 'limited' dedup makes a noticeable difference for normal day-to-day backups of unique volumes, bc wyng's chunk size is often smaller than LVM's chunk size and adjacent chunks would get re-backed-up unnecessarily in many cases without this check.

Anything beyond level 1 amounts to a global search-by-hash matching of chunks, with the different levels being different matching strategies that I'm trying out (5 definitely seems the most efficient and a decent memory-vs-speed balance). The global nature means matches can be made across all volumes in an archive, across all session times.

I'd guess you are seeing a big savings with the global modes (used by arch-deduplicate) bc you have some large volumes that are duplicated, or those volumes have some large files in common.

Here is where level 5 dedup is initialized. It reads in all the hashes for all the chunks in the archive and stores them as integers in a "tree" of bytearrays and arrays (this was the best I could do w native Python structures w/o making severe demands on RAM). If its run as arch-deduplicate then 'listfile' is opened and paths of any matching chunks are dumped to this file which is later fed to the 'dest_helper.py' program in dedup mode. Then 'dest_helper.py' uses the list to hardlink the duplicates together which saves disk space.

BTW you can also use levels 3-5 dedup with send which can make adding cloned volumes to the archive particularly efficient when net bandwidth is low. In this mode, no listfile is created but the hash trees are retained in memory and then send retrieves and uses them to find matching chunks on-the-fly.

Unfortunately, the variable naming is very abbreviated and comments are sparse for dedup. Its my intention to clean that up when dedup leaves experimental status.

tlaurion commented 4 years ago

@tasket : I may have missed where to integrate that into configuration at arch-init so that mode 5 is used at every backups?

tasket commented 4 years ago

There's currently no config for it. The option has to be specified on the command line, and there is no "commitment" in terms of archive state... you are free to use it sometimes and not others.

send dedup slows down as the archive gets larger, so keep in mind its not for every use case. The best fit is when you have just added a cloned volume to the archive.

tlaurion commented 4 years ago

Trace of extended workability for others where this is really useful, mixing non-default parameters. sudo wyng -u --meta-dir=/var/lib/2TB-wyng-backups send --testing-dedup 5

Explanation to community:

On my development station where I have a lot of Heads branches all consuming 8GB of disk space each with only small differential changes across different boards compiled; more generally having a lot of cloned TemplateVM and AppVM related LVMs containing a lot of chunk duplicates, this is really useful to make sure that what is backuped is minimal, permitting to only have required AppVMs at a time of a concept to play with, live without need of recompiling everything, outside of work sessions and to make sure everything is backup'ed after shutdowning AppVMs after having tested locally PRs having been pushed.

This send (backup) is near to no cost incremental backups of all machines, and if done frequently, not even a long process in hourly crontabs (67 LVMs known to wyng-backups: a send took real 2m31.852s, assuring all my safeguards were in place, resulting of a the last incremental of less then ~250Mb for the last hour on all VMs I did changes on.

18 is still a big missing feature to make this all go smooth, since everytime a restore is done, the AppVM needs to be manually created from QubesOS prior of receiving directly in associated LVMs. Having direct receive possibility, restoring directly Qubes AppVMs and TemplateVMs xml parts to just restore and play would be somehow magical.

Additional tickets to be thought:

@marmarek @tasket

I'm still convinced even QubesOS livecd would benefit of this, deploying TemplateVMs as a replacement for rpm not respecting normal integrity contract, where a wyng-backups local --meta-dir could be setuped by Installer to point to QubesOS ssh public server to offer users ermergency restore of deployed TemplateVMs when needed, and offering direct upgrade options of TemplateVMs, only needing from the user to export added repositories and list of installed packages to redeploy.