Consider using a binary format

Daniel15 commented 8 years ago

You might achieve better file sizes by using a binary format. Instead of using numbers and +/- for the metadata, have a bitfield for the combination of operation + offset + number of characters to remove - two bytes should be sufficient to encode all of that.

If you also encode the number of characters being added in the same way you're encoding the number of characters to remove, you'd no longer need a separator between each segment, reducing the file size even more.

gmetais commented 8 years ago

Yes, this is definitely something that I plan to work on. I found @wanderview's work after finishing sw-delta's: https://blog.wanderview.com/blog/2015/10/13/patching-resources-in-service-workers/

He built https://github.com/wanderview/ubsdiff, based on bsdiff. I'm not fluent in binary, but it looks pretty good.

The RFC 3229 standard also advertise vcdiff and diffe.

gmetais commented 8 years ago

Here you can find a sw-delta-node branch, where I tried to implement your idea with a 2 bytes bitfield for [operation + offset + number of characters to remove/add].

Here is a small benchmark where I compare gzipped size with 1) the current format, 2) the binary format I just wrote, 3) the ubsdiff format

Files	current	binary	ubsdiff
jQuery-v2.1.2-min → jQuery-v2.1.3-min	46	38	313
jQuery-v2.1.1-min → jQuery-v2.1.2-min	1797	1689	1925
angular-v1.4.4 → angular-v1.4.5	3575	3656	4032
angular-v1.4.4-min → angular-v1.4.5-min	5183	4880	5821
bootstrap-v3.3.4 → bootstrap-v3.3.5	1750	1851	2271
bootstrap-v3.3.4-min → bootstrap-v3.3.5-min	1659	1753	2264
TOTAL	14010	13867	16626

The binary format is not always smaller than the text format. In total, it's 1.02% smaller than the text format on this little benchmark.

We can safely eliminate ubsdiff, but I'm not sure if we should absolutely switch to the binary format, as the results is not as good as expected.

Actually I suspect the gzip compression to work pretty well with the current format, and to optimize the +/-/separator chars in a way that's nearly as good as a binary format. Gzip loves recurrent patterns.

Could you take a look at my implementation and tell me if you spot some possible improvements?

Daniel15 commented 8 years ago

Actually I suspect the gzip compression to work pretty well with the current format

Ahh, good catch! I didn't consider the fact that gzip would compress the diff pretty well.

You could have both algorithms - binary and text - and use the one that results in the smallest size. That seems like it'd provide the best of both worlds.

gmetais / sw-delta

Consider using a binary format #7