pieroxy / lz-string

LZ-based compression algorithm for JavaScript
MIT License
4.11k stars 568 forks source link

New collaborators and future plans #113

Open Rycochet opened 6 years ago

Rycochet commented 6 years ago

So myself, @JobLeonard, and @rquadling have recently become collaborators on here - which basically means that we'll be helping out @pieroxy to maintain this :-)

Future plans are still to be decided, but I'll put down a couple on here, and then others can contribute ideas and plans (link to issues etc where relevant).

Most importantly, this will always stay compatible with the other versions in different languages - though that might mean we can lead the pack in adding something new that isn't supported everywhere else (base91 support is one such idea - #90)

My ideas in no particular order of importance:

As I say, these are my own thoughts, and not been discussed with the new team (which is precisely what this issue is for) - anything confirmed will get a tick, anything rejected will get ~crossed out~, once there's been some discussion I'll edit this to more of a "team plans" :-)

JobLeonard commented 6 years ago

Most importantly, this will always stay compatible with the other versions in different languages - though that might mean we can lead the pack in adding something new that isn't supported everywhere else

I think that the way to go here may be to have smaller specialised modules. Not everyone needs base91, but those who do probably don't care for other functionality.

I have a personal use-case where I only care for compressing/decompressing a Uint8Array. Making a specialized function (and "compression format") that does so directly leads to smaller code, and performs a lot better performance wise (no conversions from typed to plain array to JSON.stringify, and back) and has room for specialised compression tricks (we already know we only have 256 possible byte values).

Rycochet commented 6 years ago

Given my preference for Typescript and splitting things into modules I'm a strong supporter of modules - though I'm not a fan of default exports. Edited my post to make it more explicit ;-)

JobLeonard commented 6 years ago

if we ever do make a breaking changes to the compression format, I suggest we:

tophf commented 6 years ago

JavaScript doesn't use zero-terminated strings so it's not clear why this concept should be applied here unless it'll provide some benefits.

JobLeonard commented 6 years ago

It's easier to end the bitstream: the while-loop is simplified because you look for truthy values until you hit zero.

tophf commented 6 years ago

"Easier" should be quantified. Is that loop so overcomplicated currently that it obstructs further development? Slows down execution? I find this zero-terminated string idea really weird in JavaScript and probably error-prone when used in UTF16 strings due to obscure bugs in various browsers.

JobLeonard commented 6 years ago

It's not a real zero-terminated string, it's a zero-terminated stream. Currently it's a two-terminated stream; we check for the token value of 2. Basically, there's a special termination token no matter what. Given that we're stuck with that, I find the value zero an aesthetically more pleasing choice. I know it's a nit-pick, but this order:

0 - terminate
1 - 8-bit char
2 - 16-bit char

looks prettier to me than:

0 - 8-bit char
1 - 16-bit char
2 - terminate
JobLeonard commented 6 years ago

Another idea (when making breaking changes to the format) is adding a special "run length encoding" token.

If the same (sub)string is repeated a lot, using run-length compression can be a lot more efficient. It's not that complicated to detect (basically: is the last token added also the one used for the next part), or to expand to a regular LZ-dictionary. Example:

// pretend these go on forever
const inputTrivial = "0000000000000000000000";
const inputPattern = "abababababababababababababa";

For the zero-case, it's easy to see how to detect this:

Input chars covered Additions to dictionary repeats most recent token?
0 0 no
0 00 yes
000 0000 yes
0000 00000 yes
00000 000000 yes

Encoding it could be done with (for example) a 4-bit chunking scheme where the first three bits are the actual nr, last bit indicates there's anothe 4 bits coming (so a value like 62 would be 11111100, as in 111 1 110 0. Incidentally, that would be a string of 1954 zeros).

As soon as we have a pattern, like with ababab..., it breaks, which is one of the reasons why simple LZ doesn't use them I guess:

Input chars covered Additions to dictionary repeats most recent token?
a a no
b b, ab no
ab ba yes
ab aba no
aba abab yes
ba bab no
bab baba yes
abab ababa no
ababa ababab yes
baba babab no

... however, if we use LZMW or LZAP, which I won't explain in detail here but essentially works by growing the dictionary by appending the full substring instead of one char at a time, patterns like these are trivial:

Input chars covered Additions to dictionary repeats most recent token?
a a no
b b, ab yes
ab bab no
ab abab (note: ab*2) no
abab ababab (ab*3) yes
ababab abababab (ab*5) yes
ababababab abababababababab (ab*8) yes
(ab*8) (ab*13) yes
(ab*13) (ab*21) yes
(ab*21) (ab*34) yes

The logic is now "do I use the same two tokens in a row, and then repeatedly use the newest addition to the dictionary?" Super-easy to detect!

Note that it also grows like the Fibonacci sequence. If we were to do the 62 repetition thing again, the encoded length would be ab times the sum of the first 62 fibonacci numbers, which is pretty freaking huge image

Or 1.061e13.

When compressing sparse arrays (say, nearly-empty images with long arrays of zeros) this can be a ridiculous difference in compression ratio. In every other situation, the increased bit-length is negligible: we started with one extra special token, so the bit-length increases one step earlier, so effectively the increase is at most one bit.

Rycochet commented 6 years ago

@JobLeonard ... Dude - that's an awesome thing to add - but should probably be in a separate issue before this one ends up like a certain PR lol

How about the other checkbox suggestions that we've got so far? Any comments to add to any of them?

For the "breaking changes" suggestions - I'd suggest adding them as future modules that are optionally used, instead of breaking what's already there. On a purely legal note - we would need to check the license requirements for any other techniques used and make sure there's a decent overview of how and why we can use them in the branch / PR that adds them.

Which also brings up another couple of things -

@pieroxy Do you want to keep control of releases, or do you want to create an npmjs.com organisation?

(Yes, I'm aware this is sort of project management, but it's early days and I can get lazy later lol)

JobLeonard commented 6 years ago

@JobLeonard ... Dude - that's an awesome thing to add - but should probably be in a separate issue before this one ends up like a certain PR lol

You're right, I made #114 for this. Maybe I should split them out into separate issues even.

JobLeonard commented 6 years ago

For the "breaking changes" suggestions - I'd suggest adding them as future modules that are optionally used, instead of breaking what's already there.

Totally agree. The backwards-compatibility of LZ-string is a great strength.

On a purely legal note - we would need to check the license requirements for any other techniques used and make sure there's a decent overview of how and why we can use them in the branch / PR that adds them.

I'm not really into software patents, so I hope we can figure that out easily.

JobLeonard commented 6 years ago

How about the other checkbox suggestions that we've got so far? Any comments to add to any of them?

I forgot to directly react to this before. Without saying anything about how to do this, I agree with each point in the first post so far.

I would also like to add:

This is not quite the same as just making things more modular, since that only talks about specialised modules. I mean the architecture of the compression algorithm as a whole.

For example: currently, after encoding a character to a token value, streaming the bits of this token to the output string is done within the while loop of the _compress() function. If that would be refactored into its own prototypical object that keeps track of the state of the output string, and with its own methods like stream(token, nrOfBits), it would make testing and updating the code easier.

This goes together with the TS conversion, I guess

wintifrosch commented 6 years ago

Is it possible to add an asynchronous option? like: LZString.async_compressToUTF16(<string>,process_callback)

Sometimes, I need to compress large datasets in the background, without user interaction. To compress >10MB, my client is non-responsive for >5sec.

tophf commented 6 years ago

@wintifrosch, isn't it something that can be (and should be) solved via existing specialized solutions? Like WebWorker, for example.

JobLeonard commented 6 years ago

Hmm, I can see why that's a pain...

Well, let's start with remembering that the plan is to keep the base lib to pre-ES5. I'll repeat that because it complicates things a lot: the base lib will remain backwards compatible with pre-EcmaScript 5. We would like to build a "modern" version too though (a Python 2/Python 3 kind of deal).

So... if you want something truly asynchronous, you need a WebWorker. Like tophf said, that sounds like a specialized kind of thing that we should not include as part of library. Then again, if there is a really obvious best way to do it, we can consider a pull request for a separate JS file that implements this web worker solution.

If you need something "asynchronish" (so single-threaded but not UI-freezing), then some kind of hand-written "coroutine" that yields when running for more than (say) 20ms + window.setTimeOut construction could work.

Since I apparently am masochistic code-wise I'm already writing it... I don't think we'd include this in the main library either, but maybe as a separate lib?

wintifrosch commented 6 years ago

Thank you both for your speedy feedback. @tophf, WebWorker is a perfect fit. Looks as beeing bit hard for a beginner, but I'll look into that. @JobLeonard, looking forward to seeing your solution, though.

JobLeonard commented 6 years ago

Only did a test for each compression scheme with a single input and back, but this seems to work:

https://gist.github.com/JobLeonard/4bf109f038543a5ca1aabfdce3deab1b

Use at your own risk ;)

EDIT: fixed a stupid bug - due to async nature, you can be compressing multiple strings at the same time. Hence, the string stream needs to be encapsulated per call.

Note that this is probably quite a bit slower than plain LZString in absolute terms, due to all the extra overhead from window.setTimeout and the added function call overhead. But hey, no browser freeze!

JobLeonard commented 6 years ago

Using the method outlined here I replaced windows.setTimeout with a window.postMessage-based option that is a lot faster for smaller values of maxMillis (including the default of 20ms). See the updated gist.

EDIT: Of course, window.postMessage is not backwards compatible to pre-IE10, so in case you need that: here is the older window.setTimeout-version

@wintifrosch, I think this version works the way you want it

@tophf, do you need an unsafe variant? :P

tophf commented 6 years ago

do you need an unsafe variant?

No, we would use workers.

wintifrosch commented 6 years ago

@JobLeonard: LZStrAsync.compressToUTF16(string_to_compress, callback_function) works like a charm as a drop in replacement. I'll use your code for the moment. It will take some weeks before I can look into the WebWorker alternative. BTW: While the duration of the sync version is very predictable, the duration of the async version varies substantially, as expected. The async compression takes 2-3 times longer (14s to 21s for 14 MB, instead of 7s for sync compression). Thanks again for your support!

JobLeonard commented 6 years ago

Which version did you try? Do you have control over the kind of data that you put in? If you can guarantee that the input alphabet is small (say, only uses latin characters plus a few symbols and emoticons) and that there won't be any full-UTF16 malicious inputs, I can also try to port the unsafe version for you in the weekend (probably will do it anyway).

wintifrosch commented 6 years ago

I used version 2.0 RC with window.postMessage(messageName, "*"); on line 67. -- btw: i noticed no significant improvement by increasing maxMillis up to 300. Yes, i have restricted dataset with controlled data: it is the JSON formatted API response of our API, containing objects from a clients CRM database. The content is regular text, which is always validated on modify.

pieroxy commented 6 years ago

It's great to see how this lib is getting attention again :-) Great work guys, I pretty much agree with everything here.

If you make a breaking change just host two libs in this repo: LZString and LZString2 of LZStringNew for example so that there is no possible confusion.

As far as patents are concerned, LZW is out of the troubled waters and the rest came out of my brain so there is no guarantee :-P

JobLeonard commented 5 years ago

@wintifrosch, I don't know if you ever got the Web Worker version to work, but here is a demo with an in-line webworker: https://jsfiddle.net/vanderZwan/92oz4ep0/

The main trick is this added onmessage function:

  onmessage = function (e) {
      const {func, data} = e.data;
      let f;
      if (data && func && (f = LZString[func])) {
          postMessage(f(data));
      }
  }

To send the LZWorker a string to compress, call LZWorker.postMessage({func: "<function name>", data: "<your data string to compress/decompress>"})

To handle the returned message, assign a callback function toLZWorker.onmessage.

The main downside of this approach is that any string you wish to compress will be duplicated for the worker, and then the compressed output will be copied when it is returned as well - so your example 14 MiB input will have another 14 MiB overhead + returned output. All of this will be GC'ed of course, but it creates extra allocation pressure and GC churn.

@tophf, are you using workers yet? ;)

tophf commented 5 years ago

Yes, I'm using a worker, and I use an async Proxy wrapper to invoke it. BTW, uBlock Origin apparently implemented LZString-based compression in WASM module.

JobLeonard commented 5 years ago

No, he used his own version of LZW, which is a subtly different algorithm. It's cool though!

On Sun, Feb 3, 2019, 14:36 tophf <notifications@github.com wrote:

Yes, I'm using a worker, and I use a async Proxy wrapper to invoke it. BTW, uBlock Origin apparently implemented LZString-based compression in WASM module.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pieroxy/lz-string/issues/113#issuecomment-460052282, or mute the thread https://github.com/notifications/unsubscribe-auth/AAP3AHSu73CYAdddlQ4bGEfpa17aTjUDks5vJuXYgaJpZM4U7Jwi .