Open tlaurion opened 4 years ago
Hi @tlaurion
The default is "dedup level 1" which means wyng send
checks to see if the updated chunk hash is the same for that particular address as it was during the last backup (that's why send
now merges manifests into a 'fullmanifest' of the volume before sending, unless dedup level has been set to 0).
This 'limited' dedup makes a noticeable difference for normal day-to-day backups of unique volumes, bc wyng's chunk size is often smaller than LVM's chunk size and adjacent chunks would get re-backed-up unnecessarily in many cases without this check.
Anything beyond level 1 amounts to a global search-by-hash matching of chunks, with the different levels being different matching strategies that I'm trying out (5 definitely seems the most efficient and a decent memory-vs-speed balance). The global nature means matches can be made across all volumes in an archive, across all session times.
I'd guess you are seeing a big savings with the global modes (used by arch-deduplicate
) bc you have some large volumes that are duplicated, or those volumes have some large files in common.
Here is where level 5 dedup is initialized. It reads in all the hashes for all the chunks in the archive and stores them as integers in a "tree" of bytearrays and arrays (this was the best I could do w native Python structures w/o making severe demands on RAM). If its run as arch-deduplicate
then 'listfile' is opened and paths of any matching chunks are dumped to this file which is later fed to the 'dest_helper.py' program in dedup mode. Then 'dest_helper.py' uses the list to hardlink the duplicates together which saves disk space.
BTW you can also use levels 3-5 dedup with send
which can make adding cloned volumes to the archive particularly efficient when net bandwidth is low. In this mode, no listfile is created but the hash trees are retained in memory and then send
retrieves and uses them to find matching chunks on-the-fly.
Unfortunately, the variable naming is very abbreviated and comments are sparse for dedup. Its my intention to clean that up when dedup leaves experimental status.
@tasket : I may have missed where to integrate that into configuration at arch-init so that mode 5 is used at every backups?
There's currently no config for it. The option has to be specified on the command line, and there is no "commitment" in terms of archive state... you are free to use it sometimes and not others.
send
dedup slows down as the archive gets larger, so keep in mind its not for every use case. The best fit is when you have just added a cloned volume to the archive.
Trace of extended workability for others where this is really useful, mixing non-default parameters.
sudo wyng -u --meta-dir=/var/lib/2TB-wyng-backups send --testing-dedup 5
Explanation to community:
--meta-dir
statement (arch-init, add
), here externally backuping to mounted LUKS container locally on laptop with current config in specified --meta-dir
if backup destination is ready (mounted; available) in configured cron tabs.On my development station where I have a lot of Heads branches all consuming 8GB of disk space each with only small differential changes across different boards compiled; more generally having a lot of cloned TemplateVM and AppVM related LVMs containing a lot of chunk duplicates, this is really useful to make sure that what is backuped is minimal, permitting to only have required AppVMs at a time of a concept to play with, live without need of recompiling everything, outside of work sessions and to make sure everything is backup'ed after shutdowning AppVMs after having tested locally PRs having been pushed.
This send
(backup) is near to no cost incremental backups of all machines, and if done frequently, not even a long process in hourly crontabs (67 LVMs known to wyng-backups: a send
took real 2m31.852s
, assuring all my safeguards were in place, resulting of a the last incremental of less then ~250Mb for the last hour on all VMs I did changes on.
Additional tickets to be thought:
@marmarek @tasket
I'm still convinced even QubesOS livecd would benefit of this, deploying TemplateVMs as a replacement for rpm not respecting normal integrity contract, where a wyng-backups local --meta-dir could be setuped by Installer to point to QubesOS ssh public server to offer users ermergency restore of deployed TemplateVMs when needed, and offering direct upgrade options of TemplateVMs, only needing from the user to export added repositories and list of installed packages to redeploy.
Hey @tasket
Was wondering what was actually done now out of the box for dedupping when sending, since applying arch-dedup frees a lot of space. Can you develop a bit on that and point into code?
Thanks!