samuelcolvin / xdelta3-python

Fast delta encoding in python using xdelta3
Other
34 stars 15 forks source link

Does the encode method have the distributive property? #5

Closed i30817 closed 6 years ago

i30817 commented 6 years ago

I'm trying to xdelta the contents of zip files without extracting them to disk. Therefore i need a cycle of dumping bytes from the zip extract to memory, delta-diffing those those bytes and writing the result to file, truncating the memory offset used and redoing the cycle. It's somewhat easy to create a generator that just gives all the bytes in two lists of files for python with yield and zip_longest, so that's not a problem.

My uncertainty comes from if this will work to create a equivalent xdelta file to having the complete file. Xdelta algorithm doesn't have to 'seek' in its input files right? And is your published api (a single 'encode' method for writing the diff) enough for this usecase of feeding partial contents of files to the decode method and writing that output to disk and that is not different from having the complete byte array and doing a encode?

ie:

if s == s1 ++ s2 encode(s) = encode(s1) ++ encode(s2)

If not, is the above possible if compression method is 0 ?

samuelcolvin commented 6 years ago

No idea what your ++ means, it doesn't mean anything in python.

Currently xdelta3-python only operates on strings in memory. Unless your files are very large (to large to go into memory 2 at a time) it should be entirely possible to create a diff of each pair of files. However I'm not at all clear on your use case. Perhaps a tiny example here would help?

I imagine xdelta3 can't effectively create diffs for data too big to go into memory anyway.

i30817 commented 6 years ago

Well, i was more worried about the case where a file needs two patches applied, one to the earlier half, the other to the later. I'm using generators that produce two streams of bytes of the archives i want to delta and create a patches based on that. On the decoder end i read those bytes for the source and apply the patches.

Currently i'm assuming i won't have a file over 64mb and that it doesn't matter if a patch applies to two files only (when it should be a byte sliding window instead), should get to work on that, or investigate a alternative to tarfile that is stream usable on creation without the whole file to add in memory.

https://gist.github.com/i30817/06e5f18ac39d1d1c1765338cc5631139

samuelcolvin commented 6 years ago

I think that will never work because I imagine xdelta3 needs the whole byte array to work on at the same time.

On 19 Jan 2018 22:51, "i30817" notifications@github.com wrote:

Well, i was more worried about the case where a file needs two patches applied, one to the earlier half, the other to the later. But i'm guessing this is actually silly and i should just bite the bullet and convert my code to use a tarfile (if only it had a sensible way to create it in memory and consume it at the same time).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samuelcolvin/xdelta3-python/issues/5#issuecomment-359111859, or mute the thread https://github.com/notifications/unsubscribe-auth/AD2jGfKFQpoaGBLzGbEnWhrqyJvtef8mks5tMRxUgaJpZM4Rkk3A .

i30817 commented 6 years ago

I just tried with a memory view, it needs a equivalent of a clone, doubling the memory requirements. TypeError: argument 1 must be read-only bytes-like object, not memoryview

i30817 commented 6 years ago

BTW, that support would be a nice feature. Searching on google i see a lot of python libraries using native code with this problem, eg: https://github.com/indygreg/python-zstandard/issues/26

i30817 commented 6 years ago

I made my utility work. It takes a little bit more memory than ideal due to the memory view conversion and i'm sad it doesn't work on windows (this library would need to have a version for it and i don't see windows listed on the pypi page) but this is ok.

samuelcolvin commented 6 years ago

Glad you got it to work.

Happy to accept a pr for Windows support.

On 20 Jan 2018 10:38, "i30817" notifications@github.com wrote:

I made my utility work. It takes a little bit more memory than ideal due to the memory view conversion and i'm sad it doesn't work on windows (this library would need to have a version for it and i don't see windows listed on the pypi page) but this is ok.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samuelcolvin/xdelta3-python/issues/5#issuecomment-359161955, or mute the thread https://github.com/notifications/unsubscribe-auth/AD2jGT_6v9AGfLCkknt5xc1eSAofdsVQks5tMcGwgaJpZM4Rkk3A .