jmacd / xdelta

open-source binary diff, delta/differential compression tools, VCDIFF/RFC 3284 delta compression
http://xdelta.org
1.1k stars 184 forks source link

Still not as efficient as deltup which was released over five years ago #192

Closed birdie-github closed 9 years ago

birdie-github commented 9 years ago
$ deltup -mvj linux-3.18.tar linux-3.19.tar linux-3.18.tar-linux-3.18.tar.dtu
$ xz -9 linux-3.18.tar-linux-3.18.tar.dtu
$ ls -l
-rw-r--r-- 1 root root 2876856 Feb  9 08:20 linux-3.18.tar-linux-3.19.tar.dtu.xz

$ xdelta3-3.0.9-x86.exe -0 -s linux-3.18.tar linux-3.19.tar patch.xdelta
$ xz -9 patch.xdelta
$ ls -la
-rw-r--r--  1 root root   5150232 Apr  5 11:27 patch.xdelta.xz

I.e. twice as big as the file produced by ages old deltup ( https://github.com/jjwhitney/Deltup )

I've tried -I 0 and various very big values for -P -B but xdelta still produces results which are not on par with a very old Gentoo's diffing application.

jmacd commented 9 years ago

I have several reasons why this is not an apples-to-apples comparison, but first I'd like you to re-run your experiment. Instead of "-0" followed by xz -9, try with "-0 -S lzma" which uses three independent lzma streams, one for each section. After that there should be little benefit to the external xz command. Why don't you also, to make your numbers more meaningful, include the original file sizes and the intermediate file size produced by each.

birdie-github commented 9 years ago

I don't understand why what I did is not apples to apples comparison, but whatever.

Well, the source files can be easily fetched from the web, but if you're so busy then OK, here are the results:

xdelta3-3.0.9-x86.exe -0 -S lzma -s linux-3.18.tar linux-3.19.tar patch.xdelta
ls -la
-rw-r--r--  1 root root   5362977 Apr  7 8:04 patch.xdelta

You might probably mean to use -9 instead of -0:

xdelta3-3.0.9-x86.exe -9 -S lzma -s linux-3.18.tar linux-3.19.tar patch.xdelta
ls -la
-rw-r--r--  1 root root   3086439 Apr  7 8:06 patch.xdelta

Almost as efficient as deltup + xz -9 but still not quite there (7% less efficient).

jmacd commented 9 years ago

I didn't ask you to run the experiment because I'm busy, but because "linux-3.19" is not a very well qualified name for me to fetch myself. Which 3.19 release candidate? Which minor-version? The question was not whether I can reproduce the result, but whether you get the result you're after.

So now I'll answer your question: this is not an apples-to-apples comparison because Xdelta uses a standardized format that is designed to provide good compression without requiring a secondary compressor. The VCDIFF format was designed for low-power processors that do not perform well at bit-shifting operations, so it's designed entirely with byte-codes. It's a reasonably good format, but by trying to achieve good compression on its own, it somewhat defeats the second pass you're making with xz.

It's worth noting that Xdelta-1.x has a similar design to Deltaup, and there are certainly cases where Xdelta-1 outperforms Xdelta-3 too--assuming you factor in the secondary compression pass you're making.

Xdelta-3.x is designed for streaming very large files, and there are tradeoffs that have to be made that make it perform slightly less well for relatively small files.

Now that you're down to 7% worse than Deltaup using the -9 -S lzma flags, you should also try the -B setting. Set -B to a value larger than the size of linux-3.18.tar and that ought to make an improvement. If the entire file fits into memory, it uses a slightly different algorithm. Even so, Deltaup, Xdelta-1.x, open-vcdiff, bsdiff, many of these tools are designed to work only on files that fit into memory or a 32-bit pointer. Xdelta-3.x has no input size limit, and I am working on version 3.1.x that will lift the limit on -B to support large-memory machines. Thanks, Josh

birdie-github commented 9 years ago

Just for kicks the source files are these: https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.18.tar.xz https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.19.tar.xz

Your explanation suffices me enough. Thank you.