hucsmn / qbsdiff

fast and memory saving bsdiff 4.x compatible delta compressor and patcher
MIT License
27 stars 6 forks source link

Allow custom differ and patcher format implementations #11

Open nyurik opened 3 months ago

nyurik commented 3 months ago

bsdiff format uses two somewhat unrelated concepts that fit many, but not all use cases: diffing algorithm and patch storage format.

There are multiple diffing algorithms which could be fine-tuned depending on the data being diffed, resulting in multiple implementations. This is usually the complicated part of the library, requiring a lot of testing and profiling.

Patch storage is far simpler, but may be optimized with different compressions (e.g. brotli), or removal of the magic BSDIFF40 prefix (e.g. in case when there are millions of small diffs being stored in the same sqlite file). Patch storage format is not affected by how the patch was generated.

I would like to discuss how qbsdiff can allow user-implemented patch storage. This way I could choose to have my own compression without the magic prefix, and store the patch in my own envelope format, while qbsdiff would pass me the raw patch data or take patch data and apply it. What do you think?

hucsmn commented 3 months ago

Yes, making another envelope format to support customizable patch compression algorithm is reasonable.

But, I personally tend to just provide a fixed set of available compression algorithms in configuration, rather than user-implemented compression algorithm callbacks.

According to Colin Percival's paper of bsdiff, the constructed patch file size is highly relative to which compression algorithm is chosen.

The patch file is then constructed of three parts: First, a control file containing ADD and INSERT instructions; second, a ‘difference’ file, containing the bytewise differences of the approximate matches; and third, an ‘extra’ file, containing the bytes which were not part of an approximate match. ... While these three files together are slightly larger than the original target file, the control and difference files are highly compressible; in particular, bzip2 tends to perform remarkably well (probably due to the highly structured nature of these two files).

Theoretically, these three parts of bsdiff patch file could be compressed using different algoritms due to their different data characteristics:

Maybe some experiments on the state-of-art algorithms should be taken to update Colin Percival's conclusion in 20 years ago.

PS: There was another piece of art attempting to improve bsdiff algorithm with a brand-new patch format, in which multiple compression formats (brotli, zstd, etc.) were supported. (I have not investigated on it yet).