Delta tool does not understand renames with content changes

kornelski commented 7 years ago

When a file is moved, the binary delta tool sees it as deletion and complete re-creation of the file from scratch.

This makes binary delta ineffective for some Chrome-like apps, because they put majority of their files in Contents/Versions/vXXX/, so every minor update is a mass-rename of almost all files.

e.g.

https://downloads.vivaldi.com/snapshot-auto/Vivaldi.1.7.715.3.tar.xz https://downloads.vivaldi.com/snapshot-auto/Vivaldi.1.7.705.3.tar.xz

the binary delta takes 60MB (larger than the full 50MB .xz file!), but with the Versions subdir renamed to match the delta is only 2.3MB.

kornelski commented 7 years ago

There are two cases:

Rename without modifications
Rename with modifications

I think the case 1 can be reasonably easily supported by extending the delta format with a copy/rename operation. The source files can be found hash of their contents.

The case 2 is a bit trickier do implement, because we'd have to find "most similar" candidate for the diff. Perhaps rsync-like rolling hashes could be used? Also git does it well, so we could see if we can copy it.

zorgiepoo commented 7 years ago

Nice to have but definitely low priority with cost of higher complexity and introducing a new major version for compatibility. I doubt Apple's tools handle renames as well giving another point of reference, [edit] - and don't think most VCS do automatic rename detection either.

Not on topic but I suspect Vivaldi (or Chromium?) disobeys Apple's guidelines on where files should be placed. Anyway perhaps it isn't that a great of a role model.

zorgiepoo commented 7 years ago

git does very poorly for large and binary files by the way, which may be relevant here.

kornelski commented 7 years ago

git's in-file diffing may be weaker than bsdiff, but I've tested rename detection on two versions of Vivaldi specifically, and git found the renames well.

…
 a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Framework.framework/Vivaldi Framework              |  Bin 124104928 -> 124104928 bytes
 a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Framework.framework/_CodeSignature/CodeResources   |  304 +++++++--------
 a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Helper.app/Contents/Info.plist                     |    4 +-
 a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Helper.app/Contents/MacOS/Vivaldi Helper           |  Bin 18624 -> 18624 bytes
 a/Vivaldi.app/Contents/Versions/{1.7.705.3 => 1.7.715.3}/Vivaldi Helper.app/Contents/PkgInfo                        |    0
…

I've looked at Google's courgette, but it still seems to be focused on file2file comparisons, so I'm guessing Chrome handles the Versions folder renaming explicitly outside of the diff algorithm.

zorgiepoo commented 7 years ago

I was thinking about performance with regards to large and binary files; git is optimized for source code.

I didn't think courgette even supported macOS so not even sure the implications are the same there.

kornelski commented 7 years ago

I thought more Chromium-based apps did this, but I've checked a few and it seems only Opera, Vivaldi and Chrome itself have the Versions dir.

zorgiepoo commented 2 years ago

I picked up two more recent Vivaldi builds because the original ones posted here were dead. Here follows some interesting results (cc @kornelski if just interested in 'em). Overall though, this is a really tricky app to generate a good diff for..

They are shipping:
Vivaldi.5.0.2497.32.universal.dmg: 202,683,957 bytes (202.7 MB, likely bzip2 using level 9 compression)
Vivaldi.5.0.2497.32.app (uncompressed): 536,715,935 bytes (536.7 MB)

This is upgrading from Vivaldi.5.0.2497.28 (536,700,726 bytes; 536.7 MB) which was "two" builds away.

----

Creating a Vivaldi.5.0.2497.32.tar.xz:

Vivaldi.5.0.2497.32.tar.xz: 152,161,084 bytes (152.2 MB; I assume it's level 6 lzma)

Compress
~> time tar cfJ vivaldi.tar.xz Vivaldi-5.0.2497.32.app/
________________________________________________________
Executed in  161.46 secs   fish           external 
   usr time  159.77 secs   74.00 micros  159.77 secs 
   sys time    1.13 secs  1264.00 micros    1.12 secs 

Decompress
~> time tar -xf vivaldi.tar.xz 
________________________________________________________
Executed in    8.31 secs   fish           external 
   usr time    6.98 secs   76.00 micros    6.98 secs 
   sys time    0.48 secs  1172.00 micros    0.48 secs 

----

~> time ./BinaryDelta create --version=2 --compression bzip2 --compression-level 9 --verbose ./Vivaldi-5.0.2497.28.app ./Vivaldi-5.0.2497.32.app ./Vivaldi_v2.delta

198,590,001 bytes patch 198.6 MB -- (bzip2 compression level 9, version 2 delta)

Create
________________________________________________________
Executed in   34.30 secs   fish           external 
   usr time   30.35 secs  113.00 micros   30.35 secs 
   sys time    0.83 secs  1331.00 micros    0.83 secs 

Apply
________________________________________________________
Executed in   17.27 secs   fish           external 
   usr time   12.59 secs   95.00 micros   12.59 secs 
   sys time    1.24 secs  1349.00 micros    1.23 secs 

----

~> time ./BinaryDelta create --version=3 --compression bzip2 --compression-level 9 --verbose ./Vivaldi-5.0.2497.28.app ./Vivaldi-5.0.2497.32.app /Vivaldi_v3_bzip2.delta

183,961,719 bytes patch -- 184 MB (bzip2 compression level 9, version 3 delta)

Create
________________________________________________________
Executed in   31.75 secs   fish           external 
   usr time   27.81 secs   18.70 millis   27.79 secs 
   sys time    0.63 secs    3.01 millis    0.63 secs 

Apply
________________________________________________________
Executed in   17.05 secs   fish           external 
   usr time   11.62 secs   82.00 micros   11.62 secs 
   sys time    1.41 secs  1137.00 micros    1.41 secs 

----

~> time ./BinaryDelta create --version=3 --verbose /Users/msp/Downloads/Vivaldi-5.0.2497.28.app ./Vivaldi-5.0.2497.32.app ./Vivaldi_v3.delta

139,654,362 bytes patch -- 139.7 MB (LZMA compression level 6, version 3 delta)
217 files were cloned/renamed (mostly json, png, pak at a glance)

Create
________________________________________________________
Executed in  154.62 secs   fish           external 
   usr time  149.96 secs  107.00 micros  149.96 secs 
   sys time    1.21 secs  1766.00 micros    1.21 secs 

Apply
________________________________________________________
Executed in   12.04 secs   fish           external 
   usr time    6.64 secs   81.00 micros    6.64 secs 
   sys time    1.44 secs  974.00 micros    1.44 secs 

------

If I change the skip-file-clone-threshold from 4096 bytes to 512 bytes, I get a ~0.4 MB reduction, which is not very much. I don't know how much savings we'll get if we track renames that have also changed in content, but that's also tricky.

At least, we create deltas that are smaller than a vanilla lzma compression now. And considering developers are unlikely to use .tar.xz for updates, compared to DMG, there are some significant savings.

Additional Notes:

This shows version 3 format is significantly better than version 2 using the same compression. 14 MB of savings for metadata/container format and file clones!
In this case creating lzma version 3 file is 5x slower than bzip2. However applying a lzma version 3 file is quite a bit faster (12 vs 17 seconds). (I'm also using different APIs for bzip2 and lzma so not sure if I'm doing anything bad, hopefully not)
Switching to LZMA yields another 44 MB reduction in our delta file support! Around a 13 MB reduction compared to a vanilla lzma compressed tarball.
File clone tracking for version 3 LZMA delta file results in ~10 MB of compressed data reduction (150,336,966 bytes patch without any clone tracking). Thus most of the savings compared to a vanilla tarball look like they're coming from this. Hmm..

So, the results are not very good (for this specific app) if you compare against a vanilla lzma tarball. However they're improved a lot comparing to version 2 format or using a bzip2 delta patch or a normal DMG or ZIP app archive.

I will close this issue out for now anyway as we track basic renames and I'm not sure this is worth pursuing further.

[edit]: I changed my mind. Probably it looks like some more advanced file rename detection could result in better patches, but it's hard to tell how much.. I could potentially use something like git to tell me all the similar-renames, undo them, and see. Need to verify if savings would be significant after compression before investigating any sort of approach. Maybe for some other day.

kornelski commented 2 years ago

For rename detection rolling hash is the way to go. Scan all files, use rolling hash to chunk them, create a map of hash(chunk) => [(file,position),…] and you will know which files have most chunks in common.

zorgiepoo commented 2 years ago

I think I'm more tempted to take advantage of the .framework structure and detecting if the Current framework version changes or a removal/addition happens in that directory, if that is the most common disastrous case. All the file renames that git detects for this app are from the framework Version change, and indeed the patch comes down to about 15 MB when I change the version myself. (We've even had to rename the Sparkle.framework Version fairly recently due to a compatibility issue..)

Or maybe something more clever, like out of all the files that have been renamed without any content changes, then some sort of similarity comparison in those path components to detect a directory rename. Because if there are lots of files that are similar in content with a different relative path, there's likely a lot of them that are exactly the same too.

[edit]: so what I want to try..

If we do a pass and see an exact clone the way we currently detect it: eg ./Vivaldi-5.0.2497.28.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.28/Resources/product_logo_32.png -> ./Vivaldi-5.0.2497.32.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.32/Resources/product_logo_32.png .. then we know that the divergence is ./Vivaldi-5.0.2497.32.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.32/ -> ./Vivaldi-5.0.2497.28.app/Contents/Frameworks/Vivaldi Framework.framework/Versions/5.0.2497.28/ and so when we encounter a new item that is a descendant from the 5.0.2497.32 directory, we can test to see if the remainder of the relative file path also exists from the 5.0.2497.28 directory. Sounds like O(N^2) to me but path components aren't that large usually.

I could also insert logic for detecting framework version changes here as a subset of this approach.

--

Hm for rolling hash way, I guess that would be going through each file in the before-tree, making hash(chunk) => [(file,position),…] table. Then go through each file in after-tree (that would be treated as an addition and big enough to care about), compute the [hash(chunk)] of current file, accumulate a table of files and counters of matches, then pick the file with the most matching chunks. Then compare matching chunks / total chunks >= some percentage to decide if a diff should be computed (we don't care about detecting a file rename properly so much as making a good diff). Sort of sounds like O(N^3) (maybe O(N^2) if the chunks are unique enough) to me and there can be a lot of chunks.

kornelski commented 2 years ago

The chunking is an entirely linear process. I've got a really crappy implementation of it (sorry for completely unreadable code, it was just a proof of concept), and it takes only a couple seconds to do for Vivaldi.

https://gitlab.com/kornelski/content-chunking-throwaway/-/blob/main/src/chunking.rs#L409

it can make 78MB diff in 7 minutes (99% of this time is spent in zstd). 127MB diff in a few seconds.

zorgiepoo commented 2 years ago

Last I checked, we should be able to get down to ~15 MB in this Vivaldi case with our bsdiff implementation in ~9 minutes (on my M1 MBA), even using a simple framework version change heuristic.

Anyway at the least I will make sure our format/applier is able to handle making these diff changes (regardless of how they're detected on creation side)

On the creation side, I could start with a simple heuristic to get it working, then perhaps explore a rolling hash after.

I assumed for our case the hash chunks would have to be fix-sized because otherwise it would be hard to get them to match. I see your algorithm for example the chunk size varies based on the file size. I guess maybe we could assume that similar files have similar sizes and likely to have same number of chunks if we do something like fileSize / arbitarySize. The other concern I have is about pulling in another dependency or hand rolling something (but maybe not as bad as I think, and only BinaryDelta needs it).

If I don't find significant enough gains after a simple framework/bundle version change heuristic (using a path-prefix test and lookup, and maybe adding additional prefixes from parent paths of identical files), I'm not sure I'm going to chase this down further though. I was looking through different versions of some other apps and I didn't find too much renaming (there was some, like I had an app bundle python3.9/ instead of python3.7/ but it wasn't that significant. Even that can be easy to detect, we can cheat a little bit here for common whole-world changed cases..).

kornelski commented 2 years ago

As a side note, in terms of design, having a section of the archive that is just a list of nul-separated file names is great. With any compression on top, it's going to be efficient, and you don't even need to encode them as relative paths. You can have extra paths there for referring to paths only in the old archive.

Chunk sizes that rolling hash algorithm is aiming for ideally should be configured the same for all files, but I needed to special-case tiny and huge files, since small chunks are expensive for large files, and chunks larger than the file are obviously not that useful.

I think for renames it's always better to pick some file as a diff base than none. Even if you choose a completely useless file as a base, bsdiff will just ignore it. So even a basic heuristic is fine. It could be as basic as searching for file with the same file name in any directory, or a file with same file name extension and a similar size.

zorgiepoo commented 2 years ago

As a side note, in terms of design, having a section of the archive that is just a list of nul-separated file names is great. With any compression on top, it's going to be efficient, and you don't even need to encode them as relative paths. You can have extra paths there for referring to paths only in the old archive.

Yep, exactly what we're doing.

I think for renames it's always better to pick some file as a diff base than none. Even if you choose a completely useless file as a base, bsdiff will just ignore it. So even a basic heuristic is fine. It could be as basic as searching for file with the same file name in any directory, or a file with same file name extension and a similar size.

This was helpful confirmation. I've just made a change to use a couple of heuristics like this now (same file name, similar size, Framework version change). It looks like it works well. I still have a threshold on the amount of data because it looks like to me in some cases compressing small text files beats patches, even when files are unchanged and no diff is needed. Maybe not on an individual basis (I haven't verified), but when we compress the entire container together.

I was toying with zstd a little bit for the Vivaldi Framework binary file (> 300 MB file contending most of the patch size I believe), and it seems like with -19 I was getting around a 55 MB patch but with our BinaryDelta (bsdiff + lzma) I was getting around a 13 MB one (although it took maybe 2.5x longer). There were some other CLI options zstd suggested for me for improved compression but I think they were looking too long / lost patience there. The other thing is that we should want uncompressed data back for better cross-file compression but zstd doesn't work this way.

Another unrelated note is that the way I'm compressing/decompressing data currently is not likely optimal. Eg, fread block (original file) -> temp buffer -> compression_lib -> temp buffer2 -> fwrite block (compressed file). I could probably short circuit some of this with memory mapped IO but it might be annoying (although it's also possible this is not much of a overheard compared to actual compressing)

sparkle-project / Sparkle

Delta tool does not understand renames with content changes #979