jmacd / xdelta

open-source binary diff, delta/differential compression tools, VCDIFF/RFC 3284 delta compression
http://xdelta.org
1.1k stars 184 forks source link

Source buffer size #196

Open hurda opened 9 years ago

hurda commented 9 years ago

Using a commandline xdelta3-3.0.8.x86-64.exe -e -I 0 -B 268435456 -P 16777216 -W 16777216 for comparing two 2GB files, xdelta is allocating memory the size of 2*(source buffer size) (-B), but only finding differences when they're less or equal than (source buffer size)/2 apart. For my commandline this means it's allocating a total of 512MB of memory, and finds differences (and similarities) in a 128MB-range.

To ensure the source input is read sequentially, with no backward seeks, the encoder maintains the source horizon at half the source buffer size ahead of the input position. A source copy will not be found if it lies more than half the source buffer size away from its absolute position in the input stream.

http://web.archive.org/web/20140709015104/http://code.google.com/p/xdelta/wiki/TuningMemoryBudget

So my question is: What's happening with the other half? What is it used for?

hurda commented 9 years ago

Another problem: When using the same commandline e.g. for comparing two 1MB-files, xdelta is still allocating 512MB.

jmacd commented 9 years ago

For your first question, the other memory goes to a very large hash table data structure. Your second question -- sounds unusual but you are still passing the -B flag with 256MB right?

There are missed opportunities to rduve memory usage in case a like this, but they complicate the code quite a bit. The code is organized to work with streams, including compressed ones, which makes these memory optimizations really tricky since you don't know the size of a stream up front. I removed a few memory-reducing special cases a few releases ago... On Jun 6, 2015 4:10 AM, "hurda" notifications@github.com wrote:

Using a commandline xdelta3-3.0.8.x86-64.exe -e -I 0 -B 268435456 -P 16777216 -W 16777216 for comparing two 2GB files, xdelta is allocating memory the size of 2*(source buffer size) (-B), but only finding differences when they're less or equal than (source buffer size)/2 apart. For my commandline this means it's allocating a total of 512MB of memory, and finds differences (and similarities) in a 128MB-range.

To ensure the source input is read sequentially, with no backward seeks, the encoder maintains the source horizon at half the source buffer size ahead of the input position. A source copy will not be found if it lies more than half the source buffer size away from its absolute position in the input stream.

http://web.archive.org/web/20140709015104/http://code.google.com/p/xdelta/wiki/TuningMemoryBudget

So my question is: What's happening with the other half? What is it used for?

— Reply to this email directly or view it on GitHub https://github.com/jmacd/xdelta/issues/196.