Closed kornelski closed 2 years ago
There are two cases:
I think the case 1 can be reasonably easily supported by extending the delta format with a copy/rename operation. The source files can be found hash of their contents.
The case 2 is a bit trickier do implement, because we'd have to find "most similar" candidate for the diff. Perhaps rsync-like rolling hashes could be used? Also git does it well, so we could see if we can copy it.
Nice to have but definitely low priority with cost of higher complexity and introducing a new major version for compatibility. I doubt Apple's tools handle renames as well giving another point of reference, [edit] - and don't think most VCS do automatic rename detection either.
Not on topic but I suspect Vivaldi (or Chromium?) disobeys Apple's guidelines on where files should be placed. Anyway perhaps it isn't that a great of a role model.
git does very poorly for large and binary files by the way, which may be relevant here.
git's in-file diffing may be weaker than bsdiff, but I've tested rename detection on two versions of Vivaldi specifically, and git found the renames well.
…
a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Framework.framework/Vivaldi Framework | Bin 124104928 -> 124104928 bytes
a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Framework.framework/_CodeSignature/CodeResources | 304 +++++++--------
a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Helper.app/Contents/Info.plist | 4 +-
a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Helper.app/Contents/MacOS/Vivaldi Helper | Bin 18624 -> 18624 bytes
a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Helper.app/Contents/PkgInfo | 0
…
I've looked at Google's courgette, but it still seems to be focused on file2file comparisons, so I'm guessing Chrome handles the Versions folder renaming explicitly outside of the diff algorithm.
I was thinking about performance with regards to large and binary files; git is optimized for source code.
I didn't think courgette even supported macOS so not even sure the implications are the same there.
I thought more Chromium-based apps did this, but I've checked a few and it seems only Opera, Vivaldi and Chrome itself have the Versions dir.
I picked up two more recent Vivaldi builds because the original ones posted here were dead. Here follows some interesting results (cc @kornelski if just interested in 'em). Overall though, this is a really tricky app to generate a good diff for..
They are shipping:
Vivaldi.5.0.2497.32.universal.dmg: 202,683,957 bytes (202.7 MB, likely bzip2 using level 9 compression)
Vivaldi.5.0.2497.32.app (uncompressed): 536,715,935 bytes (536.7 MB)
This is upgrading from Vivaldi.5.0.2497.28 (536,700,726 bytes; 536.7 MB) which was "two" builds away.
----
Creating a Vivaldi.5.0.2497.32.tar.xz:
Vivaldi.5.0.2497.32.tar.xz: 152,161,084 bytes (152.2 MB; I assume it's level 6 lzma)
Compress
~> time tar cfJ vivaldi.tar.xz Vivaldi-5.0.2497.32.app/
________________________________________________________
Executed in 161.46 secs fish external
usr time 159.77 secs 74.00 micros 159.77 secs
sys time 1.13 secs 1264.00 micros 1.12 secs
Decompress
~> time tar -xf vivaldi.tar.xz
________________________________________________________
Executed in 8.31 secs fish external
usr time 6.98 secs 76.00 micros 6.98 secs
sys time 0.48 secs 1172.00 micros 0.48 secs
----
~> time ./BinaryDelta create --version=2 --compression bzip2 --compression-level 9 --verbose ./Vivaldi-5.0.2497.28.app ./Vivaldi-5.0.2497.32.app ./Vivaldi_v2.delta
198,590,001 bytes patch 198.6 MB -- (bzip2 compression level 9, version 2 delta)
Create
________________________________________________________
Executed in 34.30 secs fish external
usr time 30.35 secs 113.00 micros 30.35 secs
sys time 0.83 secs 1331.00 micros 0.83 secs
Apply
________________________________________________________
Executed in 17.27 secs fish external
usr time 12.59 secs 95.00 micros 12.59 secs
sys time 1.24 secs 1349.00 micros 1.23 secs
----
~> time ./BinaryDelta create --version=3 --compression bzip2 --compression-level 9 --verbose ./Vivaldi-5.0.2497.28.app ./Vivaldi-5.0.2497.32.app /Vivaldi_v3_bzip2.delta
183,961,719 bytes patch -- 184 MB (bzip2 compression level 9, version 3 delta)
Create
________________________________________________________
Executed in 31.75 secs fish external
usr time 27.81 secs 18.70 millis 27.79 secs
sys time 0.63 secs 3.01 millis 0.63 secs
Apply
________________________________________________________
Executed in 17.05 secs fish external
usr time 11.62 secs 82.00 micros 11.62 secs
sys time 1.41 secs 1137.00 micros 1.41 secs
----
~> time ./BinaryDelta create --version=3 --verbose /Users/msp/Downloads/Vivaldi-5.0.2497.28.app ./Vivaldi-5.0.2497.32.app ./Vivaldi_v3.delta
139,654,362 bytes patch -- 139.7 MB (LZMA compression level 6, version 3 delta)
217 files were cloned/renamed (mostly json, png, pak at a glance)
Create
________________________________________________________
Executed in 154.62 secs fish external
usr time 149.96 secs 107.00 micros 149.96 secs
sys time 1.21 secs 1766.00 micros 1.21 secs
Apply
________________________________________________________
Executed in 12.04 secs fish external
usr time 6.64 secs 81.00 micros 6.64 secs
sys time 1.44 secs 974.00 micros 1.44 secs
------
If I change the skip-file-clone-threshold from 4096 bytes to 512 bytes, I get a ~0.4 MB reduction, which is not very much. I don't know how much savings we'll get if we track renames that have also changed in content, but that's also tricky.
At least, we create deltas that are smaller than a vanilla lzma compression now. And considering developers are unlikely to use .tar.xz for updates, compared to DMG, there are some significant savings.
Additional Notes:
So, the results are not very good (for this specific app) if you compare against a vanilla lzma tarball. However they're improved a lot comparing to version 2 format or using a bzip2 delta patch or a normal DMG or ZIP app archive.
I will close this issue out for now anyway as we track basic renames and I'm not sure this is worth pursuing further.
[edit]: I changed my mind. Probably it looks like some more advanced file rename detection could result in better patches, but it's hard to tell how much.. I could potentially use something like git to tell me all the similar-renames, undo them, and see. Need to verify if savings would be significant after compression before investigating any sort of approach. Maybe for some other day.
For rename detection rolling hash is the way to go. Scan all files, use rolling hash to chunk them, create a map of hash(chunk) => [(file,position),…]
and you will know which files have most chunks in common.
I think I'm more tempted to take advantage of the .framework structure and detecting if the Current
framework version changes or a removal/addition happens in that directory, if that is the most common disastrous case. All the file renames that git detects for this app are from the framework Version change, and indeed the patch comes down to about 15 MB when I change the version myself. (We've even had to rename the Sparkle.framework Version fairly recently due to a compatibility issue..)
Or maybe something more clever, like out of all the files that have been renamed without any content changes, then some sort of similarity comparison in those path components to detect a directory rename. Because if there are lots of files that are similar in content with a different relative path, there's likely a lot of them that are exactly the same too.
[edit]: so what I want to try..
If we do a pass and see an exact clone the way we currently detect it: eg ./Vivaldi-5.0.2497.28.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.28/Resources/product_logo_32.png -> ./Vivaldi-5.0.2497.32.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.32/Resources/product_logo_32.png
.. then we know that the divergence is ./Vivaldi-5.0.2497.32.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.32/ -> ./Vivaldi-5.0.2497.28.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.28/
and so when we encounter a new item that is a descendant from the 5.0.2497.32 directory, we can test to see if the remainder of the relative file path also exists from the 5.0.2497.28 directory. Sounds like O(N^2) to me but path components aren't that large usually.
I could also insert logic for detecting framework version changes here as a subset of this approach.
--
Hm for rolling hash way, I guess that would be going through each file in the before-tree, making hash(chunk) => [(file,position),…]
table. Then go through each file in after-tree (that would be treated as an addition and big enough to care about), compute the [hash(chunk)] of current file, accumulate a table of files and counters of matches, then pick the file with the most matching chunks. Then compare matching chunks / total chunks >= some percentage to decide if a diff should be computed (we don't care about detecting a file rename properly so much as making a good diff). Sort of sounds like O(N^3) (maybe O(N^2) if the chunks are unique enough) to me and there can be a lot of chunks.
The chunking is an entirely linear process. I've got a really crappy implementation of it (sorry for completely unreadable code, it was just a proof of concept), and it takes only a couple seconds to do for Vivaldi.
https://gitlab.com/kornelski/content-chunking-throwaway/-/blob/main/src/chunking.rs#L409
it can make 78MB diff in 7 minutes (99% of this time is spent in zstd). 127MB diff in a few seconds.
Last I checked, we should be able to get down to ~15 MB in this Vivaldi case with our bsdiff implementation in ~9 minutes (on my M1 MBA), even using a simple framework version change heuristic.
Anyway at the least I will make sure our format/applier is able to handle making these diff changes (regardless of how they're detected on creation side)
On the creation side, I could start with a simple heuristic to get it working, then perhaps explore a rolling hash after.
I assumed for our case the hash chunks would have to be fix-sized because otherwise it would be hard to get them to match. I see your algorithm for example the chunk size varies based on the file size. I guess maybe we could assume that similar files have similar sizes and likely to have same number of chunks if we do something like fileSize / arbitarySize. The other concern I have is about pulling in another dependency or hand rolling something (but maybe not as bad as I think, and only BinaryDelta needs it).
If I don't find significant enough gains after a simple framework/bundle version change heuristic (using a path-prefix test and lookup, and maybe adding additional prefixes from parent paths of identical files), I'm not sure I'm going to chase this down further though. I was looking through different versions of some other apps and I didn't find too much renaming (there was some, like I had an app bundle python3.9/ instead of python3.7/ but it wasn't that significant. Even that can be easy to detect, we can cheat a little bit here for common whole-world changed cases..).
As a side note, in terms of design, having a section of the archive that is just a list of nul-separated file names is great. With any compression on top, it's going to be efficient, and you don't even need to encode them as relative paths. You can have extra paths there for referring to paths only in the old archive.
Chunk sizes that rolling hash algorithm is aiming for ideally should be configured the same for all files, but I needed to special-case tiny and huge files, since small chunks are expensive for large files, and chunks larger than the file are obviously not that useful.
I think for renames it's always better to pick some file as a diff base than none. Even if you choose a completely useless file as a base, bsdiff will just ignore it. So even a basic heuristic is fine. It could be as basic as searching for file with the same file name in any directory, or a file with same file name extension and a similar size.
As a side note, in terms of design, having a section of the archive that is just a list of nul-separated file names is great. With any compression on top, it's going to be efficient, and you don't even need to encode them as relative paths. You can have extra paths there for referring to paths only in the old archive.
Yep, exactly what we're doing.
I think for renames it's always better to pick some file as a diff base than none. Even if you choose a completely useless file as a base, bsdiff will just ignore it. So even a basic heuristic is fine. It could be as basic as searching for file with the same file name in any directory, or a file with same file name extension and a similar size.
This was helpful confirmation. I've just made a change to use a couple of heuristics like this now (same file name, similar size, Framework version change). It looks like it works well. I still have a threshold on the amount of data because it looks like to me in some cases compressing small text files beats patches, even when files are unchanged and no diff is needed. Maybe not on an individual basis (I haven't verified), but when we compress the entire container together.
I was toying with zstd a little bit for the Vivaldi Framework
binary file (> 300 MB file contending most of the patch size I believe), and it seems like with -19
I was getting around a 55 MB patch but with our BinaryDelta (bsdiff + lzma) I was getting around a 13 MB one (although it took maybe 2.5x longer). There were some other CLI options zstd suggested for me for improved compression but I think they were looking too long / lost patience there. The other thing is that we should want uncompressed data back for better cross-file compression but zstd doesn't work this way.
Another unrelated note is that the way I'm compressing/decompressing data currently is not likely optimal. Eg, fread block (original file) -> temp buffer -> compression_lib -> temp buffer2 -> fwrite block (compressed file). I could probably short circuit some of this with memory mapped IO but it might be annoying (although it's also possible this is not much of a overheard compared to actual compressing)
When a file is moved, the binary delta tool sees it as deletion and complete re-creation of the file from scratch.
This makes binary delta ineffective for some Chrome-like apps, because they put majority of their files in
Contents/Versions/vXXX/
, so every minor update is a mass-rename of almost all files.e.g.
https://downloads.vivaldi.com/snapshot-auto/Vivaldi.1.7.715.3.tar.xz https://downloads.vivaldi.com/snapshot-auto/Vivaldi.1.7.705.3.tar.xz
the binary delta takes 60MB (larger than the full 50MB .xz file!), but with the
Versions
subdir renamed to match the delta is only 2.3MB.