space-wizards / bsdiff-rs

Binary diffing algorithm implemented in Rust
Other
39 stars 11 forks source link

Binary format documentation #6

Open nyurik opened 9 months ago

nyurik commented 9 months ago

Hi, I am thinking of using this crate for https://github.com/maplibre/martin/issues/1049 - storing binary diffs for map tile distribution. This is a widely used format with multiple implementations in different languages, so creating a diff may be done with one tool, and patching the diff (in theory) could be done with another. Additionally, it would have to be versioned somehow, so that if the diff language changes in some way, we could track and avoid messing up the data. Thus, I would need to reference a definitive documentation on the diff format specifics, i.e. the "diff language". Is there such a documentation that you know of? Should it be created?

P.S. I will be happy to contribute to your project as well. Thx!

kornelski commented 9 months ago

This crate uses a bit different format than the popular tool of the same name.

There isn't a proper doc for it, but the code is simple, so it shouldn't be hard to make one. PRs welcome.

nyurik commented 5 months ago

@kornelski thx, took me a bit of time to process :) Do you think it would be possible to generate the same patching format as the output of your crate? From my understanding, it is totally OK for different libs to use different methods of producing patches, but the actual patching format should be relatively easy to make standard... unless I am missing some aspects that require patch format itself to be different?

kornelski commented 5 months ago

The patch format is just 3 integers and data, repeated. You can standardize around it if you want. You can make alternative patch implementations if you want.

The main difficulty is in finding appropriately similar patches, which uses a particularly clever sorting algorithm, described in the original bsdiff paper.

The original bsdiff tool also adds bz2 compression on top of the patch format. This tool leaves it up to the user to apply further compression, because bz2 is slow and there are newer algorithms to choose from.

nyurik commented 5 months ago

@kornelski exactly! So would it be possible to generate the patch with your lib, but apply it with the standard bsdiff? That would instantly make it standard again -- simply because the standard cannot dictate HOW the patch is created, only the format of the patch itself. Thus, given two files, A and B, the patch between them can be different (depending on which tool was used to create it), but applying a patch generated by any tool should produce identical B, no matter which tool is used to actually apply it.

Do you know if the original bsdiff standardized those 3 integers + data format, e.g. the endness, perhaps 7bit encoding of those integers, a patch version at the beginning, CRC or size expectations, any other aspects?

kornelski commented 5 months ago

No, it is not compatible with the standard bsdiff.

nyurik commented 5 months ago

I just discovered another similar rust crate - https://github.com/hucsmn/qbsdiff -- it promises to be bsdiff 4.x compatible - I wonder if these two efforts can be combined? This way we all can have the best of both worlds - good algorithm for creating patches, and cross-compatibility between various implementations. Would also be great to have some benchmarks on different approaches (without compression). Regardless, thanks for all the hard work on this!