jmacd / xdelta

open-source binary diff, delta/differential compression tools, VCDIFF/RFC 3284 delta compression
http://xdelta.org
1.09k stars 181 forks source link

Inconsistent results when changing the -B parameter #261

Open abolibibelot1980 opened 3 years ago

abolibibelot1980 commented 3 years ago

I found out about xdelta recently and made some tests with it, with video files in particular, in order to keep a reference to the original file when said file has been remuxed — either by adding subtitles, or an audio track, or simply remuxing into another container. For files which have a straightforward structure, with a (single) header, followed by video / audio streams, it works very well, the size of the generated “diff” file is as small as can be with the default setting. But I'm having trouble with TS files (Transport Stream), which have a specific structure, with (as I understand it) multiple chunks of video / audio data, each having its own header (which is meant to prevent playback issues when parts of the streams aren't transmitted properly). When remuxing such a file to MP4 or MKV, the video / audio streams are extracted from those small chunks and placed contiguously into the new container ; the remuxed file is therefore significantly smaller because of the reduced “overhead”, and the playback is generally smoother (no lag when randomly accessing any spot / timecode). I have many TS files from television recordings which I converted or want to convert to MP4 or MKV, but I'd like to keep a reference to the original TS file before deleting it, as I know from experience that there can be unexpected issues later on (for instance a video editor could refuse the remuxed file, or there could be a glitch which caused a video / audio desynchronization in the remuxed file and which could not be fixed without going back to the original TS file). Here comes xdelta (which was suggested here for such purposes).

I first used an old version, 3.0t, which I happened to find with a “patch” for an animation movie in MKV. I will note below the beginning of the file name (which is the date of broadcast), and the size of the “diff” file obtained with xdelta. Using the basic command xdelta3 -e -s "remuxed file" "original file" "diff file", I got these results :

201410150005 ... .mp4 [remuxed file] 370,479,951 bytes
201410150005 ... .ts [original file] 393,272,500 bytes
201410150005 ... .ts.diff 22,100,120 bytes

201410142230 ... .mp4 [remuxed file] 492,882,417 bytes
201410142230 ... .ts [original file] 523,175,988 bytes
201410142230 ... .ts.diff 29,532,754 bytes

201410122320 ... .mp4 [remuxed file] 649,361,203 bytes
201410122320 ... .ts [original file] 689,051,020 bytes
201410122320 ... .ts.diff 162,539,264 bytes

For the first two, the size of the “diff” file seemed acceptable, but for the third one it was obviously too big ; so, based on the advice given in this discussion, I used the -B option, setting it to the size of the source file. With -B 649361203 I got :

201410122320 ... .ts.diff [-B 649361203] 38,242,658 bytes

Much better indeed. Then I processed the first two files likewise, setting the -B parameter to the size of the corresponding TS file, figuring that it would shrink the size of the resulting “diff” file some more.

201410150005 ... .ts.diff [-B 393272500] 34,096,675 bytes

201410142230 ... .ts.diff [-B 523175988] 48,801,030 bytes

But the opposite happened : the “diff” files obtained from those two files were significantly bigger ! Then I tested one of those files with decreasing values of the -B parameter (decreasing powers of 2) :

201410142230 ... .ts.diff [xdelta 3.0t -B 536870912] 48,801,030 bytes => same as with the file's size (makes sense as that value is bigger)
201410142230 ... .ts.diff [xdelta 3.0t -B 268435456] 29,687,747 bytes
201410142230 ... .ts.diff [xdelta 3.0t -B 134217728] 29,786,624 bytes
201410142230 ... .ts.diff [xdelta 3.0t -B 67108864] 29,532,754 bytes => default value of “-B”
201410142230 ... .ts.diff [xdelta 3.0t -B 33554432] 257,993,708 bytes

It turns out that for this file, the “sweet spot” is around the default value. Using a very large value (i.e. equal to the source file's size) yields a bigger “diff” file, and using a smaller value yields a much bigger “diff” file.

I ran those tests again with xdelta 3.0.11, which seems to be the most recent stable release. The global performance was much improved (the sizes of “diff” files are consistently smaller), yet I experienced the same pattern with regards to the relative sizes of “diff” files obtained with various values of the -B parameter.

201410150005 ... .ts.diff [xdelta 3.0.11 default] 7,533,148 bytes
201410150005 ... .ts.diff [xdelta 3.0.11 -B 393272500] 11,925,966 bytes
201410142230 ... .ts.diff [xdelta 3.0.11 default] 10,116,897 bytes
201410142230 ... .ts.diff [xdelta 3.0.11 -B 523175988] 15,497,312 bytes

Bigger “diff” file with -B set to the size of the source file for those two.

201410122320 ... .ts.diff [xdelta 3.0.11 default] 156,164,290 bytes
201410122320 ... .ts.diff [xdelta 3.0.11 -B 649361203] 30,213,812 bytes

Much smaller “diff” file with -B set to the size of the source file for that one.

201410142230 ... .ts.diff [xdelta 3.0.11 -B 536870912] 15,497,312 bytes => same as with the source file's size
201410142230 ... .ts.diff [xdelta 3.0.11 -B 268435456] 10,306,754 bytes
201410142230 ... .ts.diff [xdelta 3.0.11 -B 134217728] 10,375,873 bytes
201410142230 ... .ts.diff [xdelta 3.0.11 -B 67108864] 10,116,897 bytes => default value of “-B”
201410142230 ... .ts.diff [xdelta 3.0.11 -B 33554432] 250,575,115 bytes

Smaller sizes than with the older version, but same pattern : the smaller size is obtained with the default value, setting the -B parameter to the size of the source file yields a bigger “diff” file, setting it to a lower value yields a much bigger “diff” file.

How could those results be explained ? And how could I batch process a whole directory with hundreds of TS files, if there's seemingly no way of finding an optimal setting for all of them ?

I could upload the video files used for those tests, if necessary, but they're quite big and I have a rather slow uploading speed so I would prefer to get some feedback first.

Thanks.

Gabriel, France

ivan386 commented 3 years ago

You can use -v2 parameter to see size of each block and play with -W parameter to get best results.

-v be verbose (max 2) -B bytes source window size -W bytes input window size

Max value of -W param is 67108864 (64MB)

abolibibelot1980 commented 3 years ago

Thanks for the input, but how do the -B and -W parameters interact in practice, and how should they logically be defined, relative to each other, to approach the best possible compression efficiency ? How is the block size defined based on those parameters ? How can it be explained that the highest possible value for -B doesn't yield the lowest possible size for the “diff” file ? And how can I choose one set of -B and -W values to batch process an entire folder with files of different sizes, if, as I experienced, the behaviour is markedly different from one pair of files to another ? Should files be grouped by size, with specific sets of parameters for each size group, or is the outcome of the “diff” file computation related to the specific distribution of matching / non-matching areas for each pair of files, rather than to their sheer size ?

I already tested different values of -W, with the same value for -B (the default value), and it barely affected the result, much less than changing -B : for the 523175988 bytes TS file mentioned above, with decreasing -W values between 16777216 and 16384 I got “diff” files between 29496958 and 31974330, the highest size being obtained with 16384 (lowest possible value) and the lowest size being obtained with -W 2097152 (but that test was made with the old 3.0t version, I haven't tried that with the current version since it didn't seem to have a significant effect).

Apparently the max value for -W is 16777216, not 67108864, and the default value is 8388608 ; 67108864 is the default value for -B, from what I could see, with xdelta 3.0.11.

abolibibelot1980 commented 2 years ago

(8 months later) Noone else ? Is the developper still around to provide some further insight regarding my questions from January ?