Performance hits of user choices

tasket / wyng-backup

Fast backups for logical volumes & disk images

GNU General Public License v3.0

244 stars 16 forks source link

Performance hits of user choices #81

Open tlaurion opened 3 years ago

tlaurion commented 3 years ago

Sorry if I am a bit rigid on documentation. I have a hard time wrapping my head on current, where QubesOS costs on AppVM's LVM disk for storage for backup storage has known high overhead, so I thought of using --sparse-write to spread the cpu pinning costs over 2 CPUs, giving edge over --sparse, but don't see direct benefit.

Let me explain:

Doc today says:

--sparse | Receive volume data sparsely (implies --sparse-write)
--sparse-write | Overwrite local data only where it differs (receive)

Where detailed doc says:

--sparse-write

Used with receive, this option does not prevent Wyng from overwriting existing local volumes! The sparse-write mode merely tells Wyng not to create a brand-new local volume for receive, and results in the data being sparsely written into the volume instead. This is useful if the existing local volume is a clone/snapshot of another volume and you wish to save local disk space. It is also best used when the backup/archive storage is local (i.e. fast USB drive or similar) and you don't want the added CPU usage of full --sparse mode.

--sparse

The sparse mode can be used with the receive command to intelligently overwrite an existing local volume so that only the differences between the local and archived volumes will be fetched from the archive and written to the local volume. This results in reduced remote disk and network usage while receiving at the expense of some extra CPU usage on the local machine, and also uses less local disk space when snapshots are a factor (implies '--sparse-write`).

My understanding is that present code is not paralleling any work (in neither modes, think its another tticket), so one core performance would be the limitation of combined dom0 + qube virtualized IO backup storage in my use case (which happens over wyng-backups-vm storage).

I would still have expected --sparse-write (50% storage qubes, 50% dom0) to fasten the receive operation over --sparse (100% CPU hit for local calculation and less pulling over qubes's stored backups data) where the results seem to be as equal.

Maybe you could clarify or give a bit more of insights? Otherwise I will put timestamps in my scripts.

Pertinent notes on current archive.ini conf:

chunksize = 262144
compression = bz2
compr_level = 9
hashtype = sha256

Where bz2 was chosen for Head's current busybox support and where I lost track of the chunksize and hashtype costs, so you may shed some lights if you will! :)

Also note that https://git.busybox.net/buildroot/commit/?id=6bccac75ea3f8cd66bcde3747067add14b0c4f2c relies on python script... so not gonna happen soon under Heads.

tasket commented 3 years ago

There is already an issue for generic CPU optimization. But you should know that receive has gotten very little of it so far. Probably after v0.5 we'll see some receive CPU optimization including parallel processing.

(BTW, I was able to make the wyng-extract.sh script somewhat parallel for compression because the data caching in /tmp presented the opportunity to do this easily. But Wyng itself doesn't cache data this way.)

The doc for --sparse-write doesn't mention that it introduces an extra step: comparison with the local volume data. Despite that, all chunks are retrieved from remote. Therefore, this option should be used specifically to optimize local disk space.

Re-Compression dominates receive --sparse out of necessity & bzip2 is CPU intensive. In the future we could use more efficient compression like zstd and lbzip2, as well as find ways to do more work in parallel. But for now this option is strictly "use CPU bandwidth to avoid costs on slow/expensive network".

And the respective efficiency of both options depends on just how much the archive copy differs with the local volume being over-written. (More difference = less efficiency.)

This type of issue is why backup tools like Wyng try to migrate to efficient compression libraries as soon as they can, bc uncompressed data chunks cannot be safely compared for --sparse use case. v0.3 now has Zstandard, which is fast but requires fairly recent OS version (Qubes 4.0 has lousy support for zstd). The picture for Wyng is quite complex bc libraries for new tools/formats come to Python slowly, and the decisions of repo managers in this respect is often shockingly bad; there is also a format stabilization issue w zstd that can defeat a sparse processing feature (IIRC this has also negatively impacted other backup tools) but that is slowly improving.

Hashtype is already fastest type, sha256. Chunk size IIRC was chosen to reduce the amount of metadata being sent over a slow network and reduce metadata that had to be verified in Heads env. You might try an archive configured with smaller chunk sizes (the default is 65536) to see how that impacts send/receive ops.

tlaurion commented 3 years ago

Did comparison to get virtualization and additional IO costs for --sparse vs --sparse-write. Locally mounted LVM (dom0) vs QubesOS AppVM backup storage were not significant. Output below:

chunksize = 262144
compression = bz2
compr_level = 9
hashtype = sha256

Where Windows-standalone-root was chosen because it's the biggest LVM I had in hand, weighting 16Gb of backuped data on the backup storage compressed size (26465MiB on Thin LVM reported by QubesOS manager).

AppVM (QubesOS mode):


[user@dom0 ~]$ time sudo wyng --meta-dir=/var/lib/wyng-backups-vm -u receive vm-windows-10-standalone-root --sparse-write --verbose
Wyng 0.3.0rc2 20210622
Checking metadata... OK
Receiving volume : vm-windows-10-standalone-root 20210629-154433
Saving to logical volume '/dev/qubes_dom0/vm-windows-10-standalone-root'
100.00%
  Initial snapshot created for vm-windows-10-standalone-root

real    28m30.861s
user    25m13.284s
sys 2m0.635s

[user@dom0 ~]$ time sudo wyng --meta-dir=/var/lib/wyng-backups-vm -u receive vm-windows-10-standalone-root --sparse --verbose
Wyng 0.3.0rc2 20210622
Checking metadata... OK
Receiving volume : vm-windows-10-standalone-root 20210629-154433
Saving to logical volume '/dev/qubes_dom0/vm-windows-10-standalone-root'
100.00%
  Initial snapshot created for vm-windows-10-standalone-root

real    62m10.596s
user    52m39.610s
sys 7m57.865s

[user@dom0 ~]$ time sudo wyng --meta-dir=/var/lib/wyng-backups-vm -u receive vm-windows-10-standalone-root --sparse-write --verbose
Wyng 0.3.0rc2 20210622
Checking metadata... OK
Receiving volume : vm-windows-10-standalone-root 20210629-154433
Saving to logical volume '/dev/qubes_dom0/vm-windows-10-standalone-root'
100.00%
  Initial snapshot created for vm-windows-10-standalone-root

real    28m24.503s
user    25m1.602s
sys 2m0.908s

[user@dom0 ~]$ time sudo wyng --meta-dir=/var/lib/wyng-backups-vm -u receive vm-windows-10-standalone-root --sparse --verbose
Wyng 0.3.0rc2 20210622
Checking metadata... OK
Receiving volume : vm-windows-10-standalone-root 20210629-154433
Saving to logical volume '/dev/qubes_dom0/vm-windows-10-standalone-root'
100.00%
  Initial snapshot created for vm-windows-10-standalone-root

real    62m16.607s
user    52m43.738s
sys 7m59.969s

Locally mounted LVM in dom0 of same archive. But now no IO overhead+virtualization:


[user@dom0 ~]$ sudo wyng -u --meta-dir=/var/lib/wyng-backups-local-mount --from=internal:/ --subdir=/media/home/user/wyng-backups arch-init

[user@dom0 ~]$ sudo wyng -u --meta-dir=/var/lib/wyng-backups-local-mount --from=internal:/ --subdir=media/home/user/wyng-backups arch-init
Wyng 0.3.0rc2 20210622
[user@dom0 ~]$ time sudo wyng --meta-dir=/var/lib/wyng-backups-local-mount -u receive vm-windows-10-standalone-root --sparse --verboseWyng 0.3.0rc2 20210622
Checking metadata... OK
Receiving volume : vm-windows-10-standalone-root 20210629-154433
Saving to logical volume '/dev/qubes_dom0/vm-windows-10-standalone-root'
100.00%
  Initial snapshot created for vm-windows-10-standalone-root

real    61m52.233s
user    52m23.387s
sys 7m58.963s
[user@dom0 ~]$ time sudo wyng --meta-dir=/var/lib/wyng-backups-local-mount -u receive vm-windows-10-standalone-root --sparse-write --verbose
Wyng 0.3.0rc2 20210622
Checking metadata... OK
Receiving volume : vm-windows-10-standalone-root 20210629-154433
Saving to logical volume '/dev/qubes_dom0/vm-windows-10-standalone-root'
100.00%
  Initial snapshot created for vm-windows-10-standalone-root

real    28m3.802s
user    25m4.396s
sys 1m59.709s

tasket commented 3 years ago

There is another reason why --sparse can be slower: Without sparse the list of chunks to be sent is pre-fetched by the helper program, but with sparse it must wait for the local system to compress+compare before receiving the next chunk identifier. So that introduces latency.

An idea for the future would be for receive to look for Wyng snapshots that belong to a known session in the archive, make a local comparison between the snapshot and the dest volume, and then compute the receive manifest based on that comparison + any session manifests that come after the snapshot. Re-compression would not be needed at all in such a case.

tlaurion commented 3 years ago

From https://github.com/tasket/wyng-backup/issues/83#issuecomment-873607348

Yes sshfs along with the qubes: (not qubes-ssh:) specifier can get you around these hurdles. I think sshfs can be pretty slow when used with the kind of LUKS+Ext4 loopback container (in dom0) suggested in the Readme, and this is how I use it. So I'm sorry to hear its also slow without the container layer. I noticed there is a lot of sshfs tuning advice out there; a 100% speed improvement seems pretty good.

Some clarifications. The 100% speed improvement was gained having --sparse-write over --sparse on locally mounted archive dir (comparison was done between having archive stored inside qubes:// data storage vs dom0 mounted LVM, which didn't show significant gains) in qubes:// mode.

Nothing to do on my current tests over sshfs mounted LUKS mapped container over a sshfs loopback raw file. In this scenario, sync seems the culprit of delays when applying dedup (lsof | grep media which is the final mountpoint of the mapped unlocked LUKS container's partition mounted over /media), where test results are preliminary.

Those results are opposed to testing qubes:// mode over sshfs mounted remote directory. In this scenario, it seems that sshfs over loopback unlocked LUKS container (LUKS+ext4) is faster then doing plain SSHFS mounted mountpoint of remote directory (where operations on files and directory seem a lot slower) and where listing content, du etc operations take forever, and where the mounted loopback operations (outside of the sync operations) are instantaneous.

EDIT: Will verify results of SSHFS tweaking advices (if that is the 100% expected improvement here).

tlaurion commented 2 years ago

@tasket some results on fresh Q4.1 install (and why I posted 3 bug reports)

At the time of writing this (2022-04-14) Wyng 04alpha 20220104

Doesn't have --tag working #97
Doesn't support --encrypt=off #96
Doesn't support arch-deduplication (which would be helpful to do basic sends, but then doing cleanup to economize space when computer is unused). #98

Qubes 4.1 clean install backup. dom0: sudo qubes-dom0-update python3-zstd

root-autosnap created with systemd root-autosnap.shutdown at shutdown at /usr/lib/systemd/system-shutdown/root-autosnap.shutdown:

/usr/sbin/lvremove --noudevsync --force -An qubes_dom0/root-autosnap
/usr/sbin/lvcreate --noudevsync --ignoremonitoring -An -pr -s qubes_dom0/root -n root-autosnap

Interesting enough, specifying vm-pool at arch-init still permits to backup root-autosnap from wyng.

Basically, for the next tests, we vary arch-init settings:

sudo wyng --local=qubes_dom0/vm-pool --dest=qubes://wyng-backups/ --subdir=home/user/ arch-init
sudo wyng --local=qubes_dom0/vm-pool --dest=qubes://wyng-backups/ --subdir=home/user/ arch-init --compression zlib:3

or sudo wyng --local=qubes_dom0/vm-pool --dest=qubes://wyng-backups/ --subdir=home/user/ arch-init --compression zlib:9

Then:

sudo wyng add vm-debian-11-root vm-fedora-34-root vm-whonix-gw-16-root vm-whonix-ws-16-root root-autosnap vm-anon-whonix-private vm-default-mgmt-dvm-private vm-fedora-34-dvm-private vm-personal-private vm-sys-whonix-private vm-untrusted-private vm-vault-private vm-whonix-ws-16-dvm-private vm-work-private

Then: time sudo wyng send or time sudo wyng send --dedup

Then in-between tests: sudo wyng arch-delete

Most of the CPU operations are happening over dom0, where wyng-backups seems to be waiting on IOs.

Unknowns: cost of encryption (cannot test --encrypt=off on "Wyng 0.4.0alpha release 20220104", bugs reported individually.

Knowns: x230: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz Qubes 4.1 from release ISO.

Default settings (zlib:3 compression) no --dedup

Sending backup session 20220414-160855 to qubes://wyng-backups
  100%    1357.2M  |  root-autosnap 
  100%       0.0M  |  vm-anon-whonix-private 
  100%    1517.4M  |  vm-debian-11-root 
  100%       0.0M  |  vm-default-mgmt-dvm-private 
  100%       0.0M  |  vm-fedora-34-dvm-private 
  100%    2255.5M  |  vm-fedora-34-root 
  100%       0.0M  |  vm-personal-private 
  100%       0.0M  |  vm-sys-whonix-private 
  100%       0.0M  |  vm-untrusted-private 
  100%       0.0M  |  vm-vault-private 
  100%     715.0M  |  vm-whonix-gw-16-root 
  100%       0.0M  |  vm-whonix-ws-16-dvm-private 
  100%    1140.6M  |  vm-whonix-ws-16-root 
  100%       0.0M  |  vm-work-private

real    7m44.000s
user    4m39.364s
sys 1m46.112s
7.0GB on disk

Sending initial send with --dedup:

Sending backup session 20220414-163805 to qubes://wyng-backups
  100%    1346.5M  |  root-autosnap 
  100%       0.0M  |  vm-anon-whonix-private 
  100%    1399.6M  |  vm-debian-11-root 
  100%       0.0M  |  vm-default-mgmt-dvm-private 
  100%       0.0M  |  vm-fedora-34-dvm-private 
  100%    2158.6M  |  vm-fedora-34-root 
  100%       0.0M  |  vm-personal-private 
  100%       0.0M  |  vm-sys-whonix-private 
  100%       0.0M  |  vm-untrusted-private 
  100%       0.0M  |  vm-vault-private 
  100%     279.4M  |  vm-whonix-gw-16-root 
  100%       0.0M  |  vm-whonix-ws-16-dvm-private 
  100%     487.4M  |  vm-whonix-ws-16-root 
  100%       0.0M  |  vm-work-private 

real    7m31.166s
user    4m43.038s
sys 1m39.586s
5.7GB on disk

arch-init with zlib:5 and --dedup:

Sending backup session 20220414-170001 to qubes://wyng-backups
  100%    1321.7M  |  root-autosnap 
  100%       0.0M  |  vm-anon-whonix-private 
  100%    1361.9M  |  vm-debian-11-root 
  100%       0.0M  |  vm-default-mgmt-dvm-private 
  100%       0.0M  |  vm-fedora-34-dvm-private 
  100%    2119.8M  |  vm-fedora-34-root 
  100%       0.0M  |  vm-personal-private 
  100%       0.0M  |  vm-sys-whonix-private 
  100%       0.0M  |  vm-untrusted-private 
  100%       0.0M  |  vm-vault-private 
  100%     276.9M  |  vm-whonix-gw-16-root 
  100%       0.0M  |  vm-whonix-ws-16-dvm-private 
  100%     476.2M  |  vm-whonix-ws-16-root 
  100%       0.0M  |  vm-work-private 

real    13m46.003s
user    11m17.495s
sys 1m43.612s
5.6GB on disk

arch-init with zlib:9 and --dedup:

Sending backup session 20220414-174526 to qubes://wyng-backups
  100%    1308.8M  |  root-autosnap 
  100%       0.0M  |  vm-anon-whonix-private 
  100%    1347.8M  |  vm-debian-11-root 
  100%       0.0M  |  vm-default-mgmt-dvm-private 
  100%       0.0M  |  vm-fedora-34-dvm-private 
  100%    2105.8M  |  vm-fedora-34-root 
  100%       0.0M  |  vm-personal-private 
  100%       0.0M  |  vm-sys-whonix-private 
  100%       0.0M  |  vm-untrusted-private 
  100%       0.0M  |  vm-vault-private 
  100%     273.5M  |  vm-whonix-gw-16-root 
  100%       0.0M  |  vm-whonix-ws-16-dvm-private 
  100%     471.0M  |  vm-whonix-ws-16-root 
  100%       0.0M  |  vm-work-private 

real    53m22.395s
user    50m45.852s
sys 1m47.103s
5.5GB on disk

Considering those results, there is no real gain into compressing past zlib:3, where --dedup is giving a lot, even on first send.

tlaurion commented 2 years ago

@tasket !!!! Finally found a cheap provider to experiment with.

veeble.org, 5$ USD a month, 2gb ram, 20GB ssd and 100TB bandwidth. They of course have more space/bandwidth/memory options available if needed and DNS name as well for later more serious PoC.

Was able to duplicate rsync.net subaccount setup based on basic user rights, and create rw account with ro sub-account(in subdir) used to specify what OEM image type is there (q41_insurgo here as example) where ssh authorized_keys is simply put somewhere else per sshd_config override on user match:

Match User q41_insurgo
        AuthorizedKeysFile /etc/ssh/authorized_keys-%u

So safe state restoration as a service is totally feasible on cheap storage friendly VPS services (again, no 0.4 encryption testing, but I see no stopper there. Please fix #112 though!)

@tasket : you got a x230 laptop? This might go faster now. I see you are not active on Matrix? As stated under #104 , blake2 and zstd were packaged under Heads successfully, lvm thin provisioning was hacked to bypass the past failings. Basically, the only thing missing is ash compatible wyng bash script I was not able to make work successfully before, and I would be able to even have Heads create LUKS container, vg pools and have Heads just pump a state on demand soon enough.

Basically, RW account is used by OEM/org to create archive under RO account made available to access backup archives, at the condition of having public key in authorized_keys above.

The RO account is used in qubes-ssh specified app-qube per dom0 to retrieve trusted states archives and works pretty well as opposed to sshfs (now deprecated anyway...).

We will have a problem offering states as service though.

As of now, I see that the wyng helper script and errors are at a shared location if multiple peoples were using the service at the same time, those should be isolated with different paths on the ssh server host. Want me to create an individual issue?

Some comparison of performances differences of current modes with wyng-backup defaults of arch-init:

With without --dedup and --sparse-write:

2022-08-28-170108 2022-08-28-170904 2022-08-28-171817 We see a small difference in local processing time, where bandwidth used is mostly the same.

With without --dedup and --sparse:

We see that bandwidth consumed is strongly reduced, but where CPU and processing time is increased exponentially. This setting would be perfect for low bandwidth situation where user can trust locking his computer and go to sleep while this happens.

2022-08-28-185340

Consequently, I think an emphasis should be made between sparse and sparse-write in the documentation. This has high impact and I wish I could trace what accounts for the difference a bit more. dom0 was using 10-20% cpu the whole time, so not so busy to account for the difference of time processing. Load average on server is 0.00 0.01 0.05, so if something could be done by the server to help the client speed things up (through the helper), that might be a nice avenue here. There seems to be no real reason to use all that bandwidth following past result, where something seems to be missing to ease catching and transmitting only the changes needed faster. @tasket : thoughts?

Intuition here is that the client could upload a bit more about its mapping to the server (4mb upload vs 361 dl here, an hour later while the previous tests were done under minutes with 50mbit download link)

tlaurion commented 2 years ago

Conslusion --dedup with --sparse vs --sparse-write

2.14Gb (no --sparse, with --sparse-write or not specified) vs 814mb (with -sparse)
~ 7minutes (no --sparse, with --sparse-write or not specified) vs 85 minutes (with --sparse)

So having --sparse is:

Saving more then 1.2Gb of bandwidth
Taking nearly 80 minutes more of processing to do so.

tasket commented 2 years ago

@tlaurion I don't have an x230 but I do have a T430s which is internally almost identical. It currently has a basic Qubes 4.1 install and factory firmware.

blake2 isn't required for v0.4, you can manually select sha256. zstd might give you a speed boost, but it could also mess things up because the format has been evolving recently so I doubt how reproducible the resulting "comparison chunks" will be (probably an issue bc Python library and script library won't be identical). So bzip2 is still the safe bet. FWIW, I could now add gzip support to Wyng because newer Python gzip lib allows override of time header info which is required for consistent hashing.

As of now, I see that the wyng helper script and errors are at a shared location if multiple peoples were using the service at the same time, those should be isolated with different paths on the ssh server host. Want me to create an individual issue?

I think this is due to /tmp dir paths being static. I am already addressing this in v0.4 but if you need it working in v0.3 then open an issue.

The benchmark is interesting. I would not have expected 80m added with --sparse. CPU wasn't hugely affected so I think this has to do with the much higher interactivity over the network; worth investigating and improving. I did try to avert this type of issue as the current code already issues a flush op when requesting a chunk:

            else:
                print("%s/%s/%s" % (ses, faddr[1:addrsplit], faddr), flush=True, file=gv_stdin)

Maybe Python isn't pushing the flush past its various io layers, or it may be an ssh/Internet buffering behavior. But yeah, I interpret this as mostly latency/waiting occurring when it shouldn't. Obviously sparse receive could be very valuable if this were resolved so I'll definitely try to do so.

Also note --sparse-write only affects local writes; only noticeable difference would be less space used in the LVM pool. As such, it is currently imperfectly implemented because occupied chunks which are zeroed-out by receive only generate a 'discard' if LVM is configured to automatically discard zeros; ideally Wyng should generate the discard but that is not simple to do in Python. Finally, --dedup has no effect on receive although its interesting you make that association; dedup is for send only (now automatically activated for arch-deduplicate).

Intuition here is that the client could upload a bit more about its mapping to the server (4mb upload vs 361 dl here, an hour later while the previous tests were done under minutes with 50mbit download link)

Yes the procedural difference between sparse and non-sparse is that the latter sends an entire file list to the helper script in one batch, while sparse mode compares-then-requests each chunk individually. Doing it the current way actually presents opportunity for reduced (not enlarged) processing time but specific i/o behaviors may make it necessary to use asyncio to realize that potential. And yes, comparing all then sending the list to the helper would immediately improve performance, but that seems like the low road to me; we want CPU comparing and net i/o flowing simultaneously if possible.

tlaurion commented 2 years ago

zstd might give you a speed boost, but it could also mess things up because the format has been evolving recently so I doubt how reproducible the resulting "comparison chunks" will be (probably an issue bc Python library and script library won't be identical). So bzip2 is still the safe bet. FWIW, I could now add gzip support to Wyng because newer Python gzip lib allows override of time header info which is required for consistent hashing.

As per imperfect PR proposed, I was able to integrate blake2 and zstd under Heads, and removing thin-provisioning-tools checks.

zstd is and blake2 are definitely speedier, so a little bit more details on zstd not having consistent hashing would be welcome here for next steps of testing.

bzip2 is damn slow!

tasket commented 2 years ago

Yes, my initial tests of zstd files from different sources shows they don't match. Under certain conditions they are very close in size so I will look further with hexdump to see if the difference is just header info.

blake2 isn't really faster than sha256 as the latter usually benefits from hw acceleration. However, blake2 is considered more secure as it has good resistance against length extension attacks.

bzip2 does compare favorably to zstd speed when higher compression ratios are used. If you're OK with lower compression ratios (say 3.0:1 instead of 3.8:1) and compression speed is more important than net bandwidth, then gzip is a future possibility. Currently Wyng v0.3 cannot do gzip because its geared to Python 3.5.

tasket commented 2 years ago

BTW, considering you are importing new tools into Heads environment, the compression issue IIRC is resolved if the env has pigz available.

tasket commented 2 years ago

BTW2... adding gzip to Wyng only solves the consistency issue internally, for things like deduplication. When using shell script to process data, the gzip command itself has no way to suppress header timestamps. However, pigz has options to suppress timestamps for gzip format. The only alternative to using pigz in this case is to hack the header metadata created by gzip before chunks are compared.

tasket commented 2 years ago

@tlaurion After doing some manual tests with python-zstd and 'zstd' command line tool, I have some good news...

The output does match if --no-check option is used with the zstd command.

The bad news: This was tested in dom0 / fc32 system where both the python library and the CLI tool use libzstd version 1.4.x. Newer Linux releases have a CLI command version 1.5.x which does not yield matching output with the older library version. So for zstd to work with the Wyng sh script, for the time being you will have to use older zstd v1.4.x in the Heads environment.

tasket commented 2 years ago

zstandard issue explaining the conditions for reproducibility: https://github.com/facebook/zstd/issues/999

tasket commented 2 years ago

@tlaurion wyng-extract.sh has been updated in fix03 to make zstd compression reproducible and generally usable in this context. Compression levels 3-10 will give fast results with good size reduction.

tlaurion commented 2 years ago

@tlaurion wyng-extract.sh has been updated in fix03 to make zstd compression reproducible and generally usable in this context. Compression levels 3-10 will give fast results with good size reduction.

@tasket Will look at it, but as stated in PR https://github.com/tasket/wyng-backup/pull/104 the script contains bashisms that Heads' busybox (ash compliant) doesn't like.

I tried to remove some of those bashisms but broke the script doing so, leaving trace of what was needed to be removed to be more posix'ish compliant.

tlaurion commented 2 years ago

@tasket just commented on the required changes pushed under #104 to remove bashisms and where I left the PR as draft because not-functional as is. Take literally anything that is needed.

I understand that I have to pack zstd 1.4x under Heads

tlaurion commented 2 years ago

blake2 isn't really faster than sha256 as the latter usually benefits from hw acceleration. However, blake2 is considered more secure as it has good resistance against length extension attacks.

From compilation choices, I understood that blake2b is also hardware accelerated

bzip2 does compare favorably to zstd speed when higher compression ratios are used. If you're OK with lower compression ratios (say 3.0:1 instead of 3.8:1) and compression speed is more important than net bandwidth, then gzip is a future possibility.

I think the priority will be to reduce restoration times, so I guess a combination with higher compression time (zstd 3 is the default right? So should do tests with zstd 10-19?) and blake2b.

tlaurion commented 2 years ago

The bad news: This was tested in dom0 / fc32 system where both the python library and the CLI tool use libzstd version 1.4.x. Newer Linux releases have a CLI command version 1.5.x which does not yield matching output with the older library version. So for zstd to work with the Wyng sh script, for the time being you will have to use older zstd v1.4.x in the Heads environment.

Will restest this, I'm not clear on the impacts of https://github.com/facebook/zstd/issues/999#issuecomment-359538229 comment in our wyng-backup case.

@tasket It's also confusing to know that once dom0 will be upgraded the full backup archive will need to be redone? So basically, what I understand from this is that things will break if hashes are on resulting compressed data and not its origin blocks? This might be problematic?

tlaurion commented 2 years ago

@tasket I could also pack pigz instead of zstd and compare results with --sparse restoration.

For the sake of states restoration as a service, there will be a choice to be made toward archive lifetime and restoration speed over network, on which as of now I have not enough experimentation background.

The result of --sparse restoration above were the result of fix03 branch with wyng-backup default settings used to backup over local wyng qube, with python script used to receive the archive.

I only rsync'ed the archive over VPS for network based restoration tests exposed, so any clear recommendations on settings to be tested on arch-init would be welcome to optimize network bandwidth and restoration time :)

Could also switch to test 0.4 branch from now on as well. Have not followed improvements on that branch, but if integrity contract is now built in (without encryption or with it, if it can be passed as option unattended), I could start to test this instead, of course if wyng-extract script can be used with it going forward.

Not to mix performance tests with long term support as of now, but since states are meant to be selectable, I would definitely prefer directions that would not require to recreate the archives too often :)

As of now, just getting excited to have PoC over Heads.

tasket commented 2 years ago

I think the priority will be to reduce restoration times, so I guess a combination with higher compression time (zstd 3 is the default right? So should do tests with zstd 10-19?) and blake2b.

zstd level 10 will give about the same throughput as gzip/zlib level 4 but with noticeably better compression ratios. Feel free to experiment but I personally wouldn't use above zstd 10; the setting I typically use is either 3 or 7. This benchmark chart gives a general idea of the differences.

Keep in mind that for the wyng-extract.sh script in sparse mode, it must also do compression (in addition to decompression) in order to find/fetch only changed chunks.

When dom0 changes to zstd 1.5 some choices will have to be made. With Wyng-only operation, the "breakage" would manifest as dedup and remap becoming temporarily inefficient but I would expect no data corruption. Especially with a remap op (where a mismatched snapshot is deleted and new snapshot is paired) would result in a whole additional copy of the volume being added to the archive (although subsequent remaps of the same volume would not suffer this effect). IIRC the borg backup program standardized on zstd early and has issued many advisories to users to ditch and rebuild their archives after upgrading to avoid archives ballooning in size. For the time being, I will look for ways to advise/warn users, but I may put restrictions on which version can be used (already started this in the sh script).

OTOH, a careful archive user/curator could discern when zstd has changed to 1.5 and then prune all the older sessions that were done with 1.4. I think for your use case w sh script, disk space would be saved but bandwidth for dl updates is not saved.

OTOH2, Ubuntu LTS already has 1.5 of the python3-zstd library, and that version is already in Debian Testing. Fedora lags badly, however, with no update between fc32 and fc37. Maybe consider backporting the 1.5 library to Fedora ourselves.

Hashing: I would use blake2b because the difference vs sha256 may not even be noticeable as they are both far faster than most compression options.

Wyng 0.3 vs upgrading to v0.4alpha: The v0.4 format is going to change some more when alpha3 drops, but I don't anticipate any conversion roadblocks bc unencrypted data chunks will remain the same. There is already alpha1->alpha2 conversion that is done automatically but I don't anticipate v0.3->v0.4 conversion until the end of alpha3. I still prefer to test the extractor sh script on v0.3 and then convert it to v0.4 later mostly bc some tedious steps will have to be added to support v0.4 format.

Verification of v0.4 archives: Think of it being mostly the same as v0.3 except you only need to do your own verification on archive.ini if archive is unencrypted; archive.ini will verify the rest of the metadata and data. If archive is encrypted then archive.ini is self-verifying.

tasket commented 1 year ago

@tlaurion Here is my updated survey of the situation, based on feedback from zstd project and some recent tests I've made...

Assessment

Neither Zlib nor Gzip can match shell command output with Python lib output. This is unfortunate because Zlib output remains very consistent between versions ranging from Fedora 32 through 36 and Python 3.5 through 3.11.

Bzip2 output matches no matter what, across shell, Python and different versions.

Zlib, Gzip and Bzip2 are mature, stable code bases.

Zstd can be very consistent between shell and Python output if the versions are similar. Its an encouraging sign, but Zstd project is extremely noncommittal on the subject of reproducibility; if they so much as tweak a status message or fix a buffer overflow vuln we are to assume Zstd output will be different than in the past.

Options

Stay with Bzip2
Analyze the Zlib output of pigz and Python's version; we will likely find that only framing and padding differ (but with identical data segments)
Find some way to make Python handle Zlib compression everywhere (i.e. call python from the shell script). However, micropython will only decompress zlib, it doesn't support compression which is crucial.
Use Zstd and nail-down the version used. This would likely fall on admin/integrator (your) shoulders although I can facilitate by adding configurable version checks of the compression lib to either Wyng or the extract script. If Zstd reports a vulnerability and issues a patch, you would have to decide whether to use their updated version or backport the patch to your chosen Zstd version.

Other

Affects issue #54 –

SSH/Rsync/remote: The extract shell script operates as a file batch processor, so the addition of remote access transfers ought to be straightforward.

Sparse mode: At this point I would make the script blockdev-only, which gets us past the busybox fallocate shortcomings. That makes busybox dd shortcomings the biggest issue; the basic problem here is simply updating a block device in a sparse way to avoid consuming 100% disk space for each volume restore. dd sparse mode made this easy, but it can be done in other ways.

tlaurion commented 3 months ago

Weird issue with ext4 while attempting to cp -alr archive dir to another one. Seems like there is a maximum number of possible references to the same blocks?

Maybe documentation should suggest filesystem limits. As of now, we know ext4 might not be a perfect fit in terms of fixated Inode (maximum number of small files that can be created on a ext4 filesystem, determined at fs creation time) and this weird limit I encountered trying to archive an archive doing a directory copy with hardlinks tracking.

@tasket?

tasket commented 3 months ago

The hard-link limit for any single file on most Linux filesystems is about 65,000.

Having any data that is quite that dedup-prone is a very small corner case. Wyng has its internal workaround, which you helped with via your feedback. But externally, no; nothing in GNU or Linux guards against it or works around it.

That Wyng workaround could probably be enhanced so that links are kept to, say, 6500 per file instead of 65,000. But I very much doubt its a good idea to implement that before "Cloud storage API" feature.

But note... Implementing an internal archive-copying feature could also be the answer.