jmacd / xdelta

open-source binary diff, delta/differential compression tools, VCDIFF/RFC 3284 delta compression
http://xdelta.org
1.1k stars 184 forks source link

xdelta3 - another huge delta output, but resulting file can be lzma's to very low compression ratio #193

Open mykohsu opened 9 years ago

mykohsu commented 9 years ago

I see a few issues about large delta output mainly dealing with uncompressible files or special cases.

However, this situation is different. I am using xdelta to make diffs between installer versions. Normally the diff is 10-20 MB for installers in the 100MB range. However, every once in a while the diff will be 100MB. This diff can be lzma's significantly so it's definitely not an issue of compressibility.

The question is, should this be handled within xdelta or via external compression?

capture

Explanation for files in screenshot.

  1. Original installer has .exe extension.
  2. Installers are compressible as indicated by files with 7z extension.
  3. The diff between installers is 22_38.x and is larger than expected.
  4. This diff can be compressed as indicated by 22_38.7z
  5. The diff between compressed version of installer is 22_38.7x and has expected size.
mgrinzPlayer commented 9 years ago

You know that installers are probably already compressed (deflate)?

xdelta works best on unencrypted and uncompressed data by design.

If your installers are using deflate, try with precomp v0.4.3.

Creating patch:

>precomp.exe -intense -cn 22.exe
>precomp.exe -intense -cn 38.exe
>xdelta3 -fves 22.pcf 38.pcf 22_38.pcf.x

Using patch

>precomp.exe -intense -cn 22.exe
>xdelta3 -fvds 22.pcf 22_38.pcf.x 38.pcf
>precomp.exe -r 38.pcf
mykohsu commented 9 years ago

I'll give it a try but I doubt this is the solution. I say this because

  1. The installer is still pretty compressible.
  2. The diff file is even more compressible.
  3. For most installer versions, the diff from xdelta3 is not compressible.
mgrinzPlayer commented 9 years ago
  1. collect many files, about 200MB, EXEs, DLLs, some game files, some text files
  2. compress them into zip, preset: fastest. Data01.zip
  3. now, copy your collection, and change some files, about 5 MB should be enough (remove few files and add few new)
  4. compress new collection into zip, preset: fastest. Data02.zip
  5. do xdelta3 patch from Data01.zip into Data02.zip. Data01-02.fastest.zip.x
  6. Now, repeat step 2 and step 4. This time, preset: store
  7. do xdelta3 patch from Data01.zip into Data02.zip. Data01-02.store.zip.x

I've made similar test now. Data01-02.fastest.zip.x is 12.6MB (this delta can be compressed with 7z into 12.5MB) Data01-02.store.zip.x is 7.41MB (this delta can be compressed with 7z into 3.91MB)

Why we did this test? Because some installers are using some kind of compression. It can be deflate(zip), it can be LZMA (7zip), LZX (cab), etc. And zip files created in step2,4 are still compressible with 7zip (like your installers). Other installers do not use compression.

If your installers are using deflate, you achieve best result by inflating (precomp) them before creating a patch.

mykohsu commented 9 years ago

But your example does not demonstrate that. The diff using zip fastest compression (2,4) is not compressible, while the diff using zip store (no compression 6) is.

By the way, I am not sure which compression this installer is using, but it compresses 136MB -> 56MB via LZMA.

So, what I am saying - rather than xdelta not working incompressible data it is not working as well on compressible data.

By the way, according the precomp the installer probably does not have compression enabled.

100.00% - New size: 141010903 instead of 139779160

Done.
Time: 13 minute(s), 16 second(s)

Recompressed streams: 9/63
PDF streams: 1/1
ZIP streams: 2/11
PNG streams: 4/13
GIF streams: 2/2
zLib streams (intense mode): 0/36

It does help very slightly but this is not the target use case. capture

I will try it on another installer that I know is compressed, but that is not the issue I am raising here. Either I don't understand what is the secondary compression for, or it is not quite working with this data set.

Increasing input window size gives better results, but this is limited to 16MB.

mgrinzPlayer commented 9 years ago

xdelta3 doesn't care if input data is compressed or not, it doesn't care if it is compressible or not. It just works best when input files have many identical (byte exact) windows/frames.

You have bunch of files within installer1 and installer2 contains mostly the same files and few updated/newer.

There can be two outcomes:

"By the way, according the precomp the installer probably does not have compression enabled" precomp doesn't recognize all compression methods.

"Increasing input window size gives better results, but this is limited to 16MB" More significant change is when you use bigger -B, like -B 268435456

"Either I don't understand what is the secondary compression for" If you don't want to use additional programs like rar, 7zip or UHarc then you just use "-9" with "-S fgk" or "-S djw" or "-S lzma". If you compress diff file with another program anyway, use "-S none" (this is assumed as default, so, can stop using -S switch).

Real life example, installers with LZX compression: 1 10-11.msi.xdelta - the diff between installers 10-11.msi.xdelta.7z - diff can be compressed 10-11.7z.xdelta - diff between 7z compressed installers

Installers without compression (method store) 2 10-11.msi.xdelta - the diff between installers 10-11.msi.xdelta.7z - diff can be compressed even more

It is slightly different than your installers. But, you should see the point.

mykohsu commented 9 years ago

Great detailed explanation. But this indicates that secondary compression is not working. Changing block size only has very minimal effect.

256MB block size, 16MB input window, lzma, -9

xdelta3: source 22.exe source size 129 MiB [136246685] blksize 256 MiB window 256 MiB (FIFO)
xdelta3: 0: in 16.0 MiB: out 11.1 MiB: total in 16.0 MiB: out 11.1 MiB: 6.1 sec
xdelta3: 1: in 16.0 MiB: out 12.2 MiB: total in 32.0 MiB: out 23.2 MiB: 4.9 sec
xdelta3: 2: in 16.0 MiB: out 12.3 MiB: total in 48.0 MiB: out 35.5 MiB: 4.6 sec
xdelta3: 3: in 16.0 MiB: out 9.54 MiB: total in 64.0 MiB: out 45.0 MiB: 3.6 sec
xdelta3: 4: in 16.0 MiB: out 4.79 MiB: total in 80.0 MiB: out 49.8 MiB: 2.0 sec
xdelta3: 5: in 16.0 MiB: out 12.1 MiB: total in 96.0 MiB: out 62.0 MiB: 4.4 sec
xdelta3: 6: in 16.0 MiB: out 12.3 MiB: total in 112 MiB: out 74.2 MiB: 4.4 sec
xdelta3: 7: in 16.0 MiB: out 12.6 MiB: total in 128 MiB: out 86.9 MiB: 4.5 sec
xdelta3: 8: in 5.30 MiB: out 2.85 MiB: total in 133 MiB: out 89.7 MiB: 1.2 sec
xdelta3: finished in 39 sec; input 139779160 output 94065881 bytes (67.30%)

Default block size (2MB), 16MB input window, lzma, -9

xdelta3: source 22.exe source size 129 MiB [136246685] blksize 2.00 MiB window 64.0 MiB (FIFO)
xdelta3: 0: in 16.0 MiB: out 11.1 MiB: total in 16.0 MiB: out 11.1 MiB: 4.4 sec
xdelta3: 1: in 16.0 MiB: out 12.2 MiB: total in 32.0 MiB: out 23.2 MiB: 5.0 sec
xdelta3: 2: in 16.0 MiB: out 12.3 MiB: total in 48.0 MiB: out 35.5 MiB: 5.5 sec
xdelta3: 3: in 16.0 MiB: out 9.54 MiB: total in 64.0 MiB: out 45.0 MiB: 4.2 sec
xdelta3: 4: in 16.0 MiB: out 4.79 MiB: total in 80.0 MiB: out 49.8 MiB: 2.2 sec
xdelta3: 5: in 16.0 MiB: out 12.1 MiB: total in 96.0 MiB: out 62.0 MiB: 5.4 sec
xdelta3: 6: in 16.0 MiB: out 12.3 MiB: total in 112 MiB: out 74.2 MiB: 5.3 sec
xdelta3: 7: in 16.0 MiB: out 12.6 MiB: total in 128 MiB: out 86.9 MiB: 5.5 sec
xdelta3: 8: in 5.30 MiB: out 2.87 MiB: total in 133 MiB: out 89.7 MiB: 1.5 sec
xdelta3: finished in 41 sec; input 139779160 output 94099657 bytes (67.32%)

Default block size (2MB), default input window (8 MB), lzma, -9

xdelta3: source 22.exe source size 129 MiB [136246685] blksize 2.00 MiB window 64.0 MiB (FIFO)
xdelta3: 0: in 8.00 MiB: out 4.05 MiB: total in 8.00 MiB: out 4.05 MiB: 1.9 sec
xdelta3: 1: in 8.00 MiB: out 7.02 MiB: total in 16.0 MiB: out 11.1 MiB: 3.0 sec
xdelta3: 2: in 8.00 MiB: out 8.00 MiB: total in 24.0 MiB: out 19.1 MiB: 3.5 sec
xdelta3: 3: in 8.00 MiB: out 7.99 MiB: total in 32.0 MiB: out 27.1 MiB: 3.2 sec
xdelta3: 4: in 8.00 MiB: out 7.99 MiB: total in 40.0 MiB: out 35.1 MiB: 3.4 sec
xdelta3: 5: in 8.00 MiB: out 7.99 MiB: total in 48.0 MiB: out 43.0 MiB: 3.7 sec
xdelta3: 6: in 8.00 MiB: out 7.99 MiB: total in 56.0 MiB: out 51.0 MiB: 3.7 sec
xdelta3: 7: in 8.00 MiB: out 1.56 MiB: total in 64.0 MiB: out 52.6 MiB: 738 ms
xdelta3: 8: in 8.00 MiB: out 28.0 B: total in 72.0 MiB: out 52.6 MiB: 7 ms
xdelta3: 9: in 8.00 MiB: out 4.79 MiB: total in 80.0 MiB: out 57.4 MiB: 2.2 sec
xdelta3: 10: in 8.00 MiB: out 7.99 MiB: total in 88.0 MiB: out 65.4 MiB: 3.4 sec
xdelta3: 11: in 8.00 MiB: out 7.99 MiB: total in 96.0 MiB: out 73.4 MiB: 3.4 sec
xdelta3: 12: in 8.00 MiB: out 7.99 MiB: total in 104 MiB: out 81.4 MiB: 3.3 sec
xdelta3: 13: in 8.00 MiB: out 7.99 MiB: total in 112 MiB: out 89.4 MiB: 3.2 sec
xdelta3: 14: in 8.00 MiB: out 7.99 MiB: total in 120 MiB: out 97.3 MiB: 3.6 sec
xdelta3: 15: in 8.00 MiB: out 4.66 MiB: total in 128 MiB: out 101 MiB: 2.1 sec
xdelta3: 16: in 5.30 MiB: out 2.88 MiB: total in 133 MiB: out 104 MiB: 1.5 sec
xdelta3: finished in 48 sec; input 139779160 output 109964174 bytes (78.67%)

Try djw just in case.

Default block size (2MB), default input window (8 MB), djw, -9

xdelta3-x64.exe -e -9 -S djw -I 0 -v -f -s
xdelta3: finished in 26 sec; input 139779160 output 110231741 bytes (78.86%)
mgrinzPlayer commented 9 years ago

I think xdelta3 "-S lzma" do not use big dictionary. Probably dictionary is 16MB, FastBytes=32,

10-11.msi.xdelta.7z - compressed with 7zip/max - 3.75MB 10-11.msi.xdelta-SecLZMA - secondary compression "-S lzma" - 4.04MB

jmacd commented 9 years ago

I wonder if this has to do with LZMA using one of the exe prefilters for x86 executables? That would explain why LZMA can do a good job on this data, while Xdelta seemingly can't. I haven't looked into using prefilters of that nature with Xdelta, but it might be possible.

On Fri, May 8, 2015 at 6:04 PM, mgrinzPlayer notifications@github.com wrote:

I think xdelta3's "-S lzma" do not use big dictionary. Probably dictionary is 16MB, FastBytes=32,

10-11.msi.xdelta.7z - 3.75MB 10-11.msi.xdelta-SecLZMA - 4.04MB

— Reply to this email directly or view it on GitHub https://github.com/jmacd/xdelta/issues/193#issuecomment-100404505.

mgrinzPlayer commented 9 years ago

djw has those: "Thursday, November 08, 2007, -S djw exposes secondary compression levels as -S djw1 .. -S djw9"

What about LZMA compression levels? Is it max by default? Or like this: "-0 -S lzma" - xdelta compression level 0 and secondary lzma has set compression level to 0 ... "-7 -S lzma" - xdelta compression level 7 and secondary lzma has set compression level to 7 ... "-9 -S lzma" - xdelta compression level 9 and secondary lzma has set compression level to 9

jmacd commented 9 years ago

I'm not convinced this is a matter of setting the LZMA level, which I'm happy to do. I believe this is because 7z does special things when it encounters executable code... Things that are not done by liblzma.

djw has those http://xdelta.org/2007/11/xdelta-3.html: "Thursday, November 08, 2007, -S djw exposes secondary compression levels as -S djw1 .. -S djw9"

What about LZMA compression levels? Is it max by default? Or like this: "-0 -S lzma" - xdelta compression level 0 and secondary lzma has set compression level to 0 ... "-7 -S lzma" - xdelta compression level 7 and secondary lzma has set compression level to 7 ... "-9 -S lzma" - xdelta compression level 9 and secondary lzma has set compression level to 9

— Reply to this email directly or view it on GitHub https://github.com/jmacd/xdelta/issues/193#issuecomment-101051715.

mgrinzPlayer commented 9 years ago

I opened delta file (-S lzma) with 7zip, info window: info from 7zip

Found this:

xd3_lzma_init (xd3_stream *stream, xd3_lzma_stream *sec, int is_encode)
{
...
      int preset = (stream->flags & XD3_COMPLEVEL_MASK) >> XD3_COMPLEVEL_SHIFT;
      (...)lzma_lzma_preset(&sec->options, preset))
...
}

So, it uses presets or not? I don't see -0 -9 "presets" here:

>lzma
LZMA 9.18 beta : Igor Pavlov : Public domain : 2010-11-02

Usage:  LZMA <e|d> inputFile outputFile [<switches>...]
  e: encode file
  d: decode file
  b: Benchmark
<Switches>
  -a{N}:  set compression mode - [0, 1], default: 1 (max)
  -d{N}:  set dictionary size - [12, 30], default: 23 (8MB)
  -fb{N}: set number of fast bytes - [5, 273], default: 128
  -mc{N}: set number of cycles for match finder
  -lc{N}: set number of literal context bits - [0, 8], default: 3
  -lp{N}: set number of literal pos bits - [0, 4], default: 0
  -pb{N}: set number of pos bits - [0, 4], default: 2
  -mf{MF_ID}: set Match Finder: [bt2, bt3, bt4, hc4], default: bt4
  -mt{N}: set number of CPU threads
  -eos:   write End Of Stream marker
  -si:    read data from stdin
  -so:    write data to stdout

Also: default dictionary size: 8MB default number of fast bytes: 128

mykohsu commented 9 years ago

I tried opening the -S lzma file in 7-zip 9.30 alpha, but it does not open.

However, some testing showed that the compression only works with dictionary size 16MB or larger. Maybe the dictionary size should be set to some larger factor of the delta or inputs.

jmacd commented 9 years ago

FYI

Xdelta is not simply passing its output through LZMA when -S lzma is set. The VCDIFF file format consists of three sections per window: Instructions, Addresses, and Text. Each of these sections is compressed independently using separate LZMA streams, since the data have distributions in each stream. I'm open to increasing the memory used by LZMA.

But, I'm concerned that what's actually happening is that when lzma sees the original data, it detects that an executable file is being compressed and it runs a prefilter to make the executable data more compressible. Xdelta does not do that, and further it makes executable data unrecognizable as such to LZMA and therefore the data output by Xdelta cannot be prefiltered thus compresses less well.

So I'm waiting to be convinced that this is only a problem with LZMA settings and not related to the use of executable prefilters.

On Mon, May 18, 2015 at 11:40 AM, mykohsu notifications@github.com wrote:

I tried opening the -S lzma file in 7-zip 9.30 alpha, but it does not open.

However, some testing showed that the compression only works with dictionary size 16MB or larger. Maybe the dictionary size should be set to some larger factor of the delta or inputs.

— Reply to this email directly or view it on GitHub https://github.com/jmacd/xdelta/issues/193#issuecomment-103165927.

mykohsu commented 9 years ago

Well, LZMA can compress the data output by xdelta, so there is definitely some setting that can be used to achieve similar results.

But this is not the common case, it only happens when the delta is a significant portion of the source.

jmacd commented 9 years ago

OK, give me your favorite pair of files to use as an example and I'll investigate.

On Mon, May 18, 2015 at 11:53 AM, mykohsu notifications@github.com wrote:

Well, LZMA can compress the data output by xdelta, so there is definitely some setting that can be used to achieve similar results.

But this is not the common case, it only happens when the delta is a significant portion of the source.

— Reply to this email directly or view it on GitHub https://github.com/jmacd/xdelta/issues/193#issuecomment-103173720.

mykohsu commented 9 years ago

I'm unable to share my installers so will look around for some other pairs and try to reproduce it.

mgrinzPlayer commented 9 years ago

"tried opening the -S lzma file in 7-zip 9.30 alpha, but it does not open"

You can with 7zip 9.38 beta. Then you can check compression method. Which LZMA stream? I don't know.

PS: recently Igor released "7-Zip 15.02 alpha". I didn't test it.

mykohsu commented 9 years ago

Thanks.

What I can confirm is that -S lzma creates files using LZMA2:18, or 256KB dictionary. Using 256KB dictionary myself on the output file results in no compression. However, using 16MB dictionary, or LZMA2:24, results in compression to 31% ratio.