rotemdan / lzutf8.js

A high-performance Javascript string compression library
MIT License
320 stars 26 forks source link

Compatible with gzip? #28

Open burtonator opened 3 years ago

burtonator commented 3 years ago

Are the compressed strings compatible with gzip? That would make interop easily so that the raw data could be worked with using external tools.

rotemdan commented 3 years ago

lz-utf8 was mostly designed and developed in the summer of 2014, before asm.js (introduced in 2013) and the later WebAssembly (introduced in 2017) were widely supported by browsers. Back then Javascript ports for Gzip did exist, but were very slow and impractical for most uses.

At the time I thought it was a good/"cool" idea to design a compression format that was simple and fast enough to run within the browser, and since it was designed only for strings (not general binary data), I thought it would be nice that it could be somehow compatible with utf8 strings (so plain utf8 strings are a binary valid form of lz-utf8 binary data, thus, for example, you could send plain utf8 strings from a server and decode them with the lz-utf8 decoder with no issues). Internally it differentiates itself from utf8 by modifying a single bit. In more detail:

The utf-8 codepoint is one of the forms:

So in lz-utf8 I used these two non-conflicting patterns to differentiate itself without breaking utf-8 (see more details in the technical paper):

There's no compatibility with any other known format. It's a unique design. It also doesn't include nor support any headers. There's zero overhead - non compressible string will have exactly the length of the original string. This also means it cannot be made compatible with any other format that has headers (like gzip).

Today with WebAssembly of course you could just run a port of zlib (or use some wrapper library) in the browser with near-native performance.

And of course the core lzutf8 could (probably "should" at this point) be rewritten in C/Rust to be compiled WebAssembly (to get native [possibly up to 10-20x] performance in the browser and server and proper command line tools) but I don't have the time or feel the incentive to do so.

So it's basically a bit of a "novelty" format that I made a while back because I thought it was somehow cool, and I'm not especially excited about making further development to, since that probably wouldn't really give me any real benefits or value.

burtonator commented 3 years ago

Thanks. this was my thinking as well. I'm trying to find a gzip impl written in WebAssembly (ideally) that works with a web worker.. haven't found one yet :-/

prantlf commented 2 months ago

Nowadays you can use Compression Streams API on both pages and in web workers. However, you'll have to encode the binary output using some text-friendly encoding, if you want to store the output as string. So, this library is still worth looking at, utilising all UTF-8 characters.