gmetais / sw-delta

An incremental cache for the web
GNU General Public License v2.0
1.1k stars 31 forks source link

Consider using a binary format #7

Open Daniel15 opened 8 years ago

Daniel15 commented 8 years ago

You might achieve better file sizes by using a binary format. Instead of using numbers and +/- for the metadata, have a bitfield for the combination of operation + offset + number of characters to remove - two bytes should be sufficient to encode all of that.

If you also encode the number of characters being added in the same way you're encoding the number of characters to remove, you'd no longer need a separator between each segment, reducing the file size even more.

gmetais commented 8 years ago

Yes, this is definitely something that I plan to work on. I found @wanderview's work after finishing sw-delta's: https://blog.wanderview.com/blog/2015/10/13/patching-resources-in-service-workers/

He built https://github.com/wanderview/ubsdiff, based on bsdiff. I'm not fluent in binary, but it looks pretty good.

The RFC 3229 standard also advertise vcdiff and diffe.

gmetais commented 8 years ago

Here you can find a sw-delta-node branch, where I tried to implement your idea with a 2 bytes bitfield for [operation + offset + number of characters to remove/add].

Here is a small benchmark where I compare gzipped size with 1) the current format, 2) the binary format I just wrote, 3) the ubsdiff format

Files current binary ubsdiff
jQuery-v2.1.2-min → jQuery-v2.1.3-min 46 38 313
jQuery-v2.1.1-min → jQuery-v2.1.2-min 1797 1689 1925
angular-v1.4.4 → angular-v1.4.5 3575 3656 4032
angular-v1.4.4-min → angular-v1.4.5-min 5183 4880 5821
bootstrap-v3.3.4 → bootstrap-v3.3.5 1750 1851 2271
bootstrap-v3.3.4-min → bootstrap-v3.3.5-min 1659 1753 2264
TOTAL 14010 13867 16626

The binary format is not always smaller than the text format. In total, it's 1.02% smaller than the text format on this little benchmark.

We can safely eliminate ubsdiff, but I'm not sure if we should absolutely switch to the binary format, as the results is not as good as expected.

Actually I suspect the gzip compression to work pretty well with the current format, and to optimize the +/-/separator chars in a way that's nearly as good as a binary format. Gzip loves recurrent patterns.

Could you take a look at my implementation and tell me if you spot some possible improvements?

Daniel15 commented 8 years ago

Actually I suspect the gzip compression to work pretty well with the current format

Ahh, good catch! I didn't consider the fact that gzip would compress the diff pretty well.

You could have both algorithms - binary and text - and use the one that results in the smallest size. That seems like it'd provide the best of both worlds.