samuelcolvin / xdelta3-python

Fast delta encoding in python using xdelta3
Other
34 stars 15 forks source link

xdelta3.XDeltaError: Output of decoding delta longer than expected #2

Open karamanolev opened 6 years ago

karamanolev commented 6 years ago

I'm trying to use your library. Here's a simple example of how I'm trying it:

    b1 = open('b1.bin', 'rb').read()
    b2 = open('b2.bin', 'rb').read()

    d = xdelta3.encode(b1, b2)
    b3 = xdelta3.decode(b1, d)
    print(b2 == b3)

This gives me an exception: xdelta3.XDeltaError: Output of decoding delta longer than expected

I'm attaching a zip file with b1.bin and b2.bin. By the way, they're simple HTML files, I just made sure to treat them as binary so I don't trip over encodings. Also, when trying it in reverse (not b1 -> b2, but b2 -> b1) it all works well.

$ python --version Python 3.6.1 Running on Mac OS, Python installed via pyenv. xdelta3==0.0.5.

files.zip

samuelcolvin commented 6 years ago

I'm getting the same error.

look here

output shouldn't be bigger than the original plus the delta, but give a little leeway

Basically I assume that the size of the output (b3 here) won't be bigger than len(b1) + len(d) plus some breaking space.

I guess that was wrong. Perhaps b1 has lots of repeated content which xdelta 3 is cleverly encoding. In short I think xdelta3 is performing better than I assumed it could.

I'm no expert C developer, but I would guess the solution is to just increase output_alloc to maybe input_size + source_size * 2.

I'll create a PR for this. Do you mind if I use the two files in files.zip for unit tests?

samuelcolvin commented 6 years ago

see #3 @karamanolev, would be great if you could try that branch and check it's working for you.

If ok, I'll merge and deploy an update.

karamanolev commented 6 years ago

@samuelcolvin Feel free to use the files. Please mind that they were retrieved from a website that I don't own, but I don't think it's ever going to be an issue.

As for the output size, it can be very big compared to inputs. Take a look:

import xdelta3
b1 = b''
b2 = b't' * 10000
len(xdelta3.encode(b1, b2))  # This is 17

This means you can have an input of 0 bytes and 17 bytes respectively and generate an output of 10000 bytes (10k bytes). If you have 0 bytes and 1 gigabyte of repeated characters, the delta is 1923 bytes. There won't be a formula in the world that allows you to statically allocate a buffer you know is going to enough.

samuelcolvin commented 6 years ago

Ok, but in real work #3 is better than the current value.

Unless you want to submit a PR to dynamically allocate the output buffer size?

karamanolev commented 6 years ago

I'll take a look. The right way to do it, I think, is to implement the streaming capabilities of xdelta3. A compromise with amortized linear complexity would be to start with a reasonably sized buffer (e.g. the old size) and then double it each time you end up with ENOSPC. That shouldn't be too hard to implement.

samuelcolvin commented 6 years ago

Agreed doubling would be easier. The problem is it will be slow(er) since the decode will have to retry until a suitable size is reached, what's worse this performance degradation will be silent.

I can see a situation along the lines of "Here's this sitation where xdelta three should be brilliant and save me loads of data but it's being a bit slow. Why is that?". But then I don't see a better solution.

Perhaps it would be worth asking on the main xdelta3 repo for advice.

karamanolev commented 6 years ago

It shouldn't be slower by more than 2 times (amortized linear). And certainly better than throwing errors. As for the right way - the streaming interface + allocating buffer blocks as you go. https://github.com/jmacd/xdelta/blob/wiki/ProgrammingGuide.md it's documented here, xd3_stream/xd3_config. The implementation would be a bit more tricky though.

samuelcolvin commented 6 years ago

ok, makes sense. Could you create a PR for doubling buffer size?

samuelcolvin commented 6 years ago

ok, I've updated #3 to double output_alloc until the buffer is big enough.

@karamanolev does that look ok to you?

karamanolev commented 6 years ago

@samuelcolvin Your PR looks OK to me. I've opened another, alternative one, which also happens to fix my other big issue with the library as is (the NoDelta exception is just surprising, in a bad way). I put argumentation in the PR, but here's some more: If you get the NoDelta exception, you'll have to add application logic to handle it anyway (you can't just ignore it or something like this). If you're going to add app logic for that, you can just as well check for the delta length yourself. Imagine you don't want to add logic and you just want to store the delta, even if it's a little larger, that's fine. That will likely be 0.001% of your cases and will probably not cause a big storage issue. In that case you're left to hang dry - you can't use the library, because it just throws an exception, no return value at all. This is what happened in my case.

mwojnars commented 4 years ago

Alternative solution: upon encoding, prepend in "delta" the size of the original string - that's a single integer. Upon decoding, read this value first, then allocate a buffer sized exactly as needed - you won't need reallocation nor compromise on speed.

samuelcolvin commented 4 years ago

Makes lots of sense, then default to the current value.

PR welcome.