rotemdan / lzutf8.js

A high-performance Javascript string compression library
MIT License
322 stars 26 forks source link

Reference examples of before/after compression? #34

Open ianfixes opened 3 years ago

ianfixes commented 3 years ago

I've made a good amount of progress porting this compression algorithm to Ruby. However, only most of my test cases work -- not all of them. I'm borrowing some from the xunit-viewer NPM project, in which a compressed & b64-encoded string is used.

I'm running into sporadic errors in my implementation where the distance in a sized pointer sequence is greater than the length of the output string. Since this repo doesn't include hard copies of expected compressed data, I'm not sure how to troubleshoot.

Can you reply with a set of inputs and outputs, b64-encoded for absolute certainty of their contents? That would be a huge help.

rotemdan commented 3 years ago

I appriciate you're interested in porting my work!

If I recall correctly, sized pointers beyond the window size (32k) are not possible (but my memory is quite blurry since this was designed many years ago).

The repository contains a suite of tests that use various test cases in various languages (including Chinese, Hindi etc.) and also includes a randomized test generator producing random-length utf8 strings. The input is compressed and then decompressed and checked for validity thousands of times. I don't think that short test inputs would suffice since the most bug-prone aspect of the code is in partial (streamed) encoding and decoding (which was a bit of a "hell" to write and test - wouldn't want to redo that again or recommend anyone else to).

If the code was ported to WebAssembly, there's a runtime called wasmer-ruby that could run it at native speeds from Ruby code.

To make a WebAssembly port, the core of the library doesn't necessarily needs to be rewritten. There's a language called AssemblyScript that can take TypeScript-like code and directly compile it to WebAssembly.

With a WebAssembly port all that's needed is to replace the native JavaScript code with the WebAssembly core and run the existing test suite and the core of the library would be tested for all platforms (including Ruby).

Unfortunately that seems like quite a bit of work (I'd say at least a week). I can't put that amount of work any time soon. I might also decide to modify the algorithm while I'm working on it (or not, I'm not sure).

ianfixes commented 3 years ago

The ruby implementation (non-optimized) of both compression and decompression fits in 100 lines, so no need for the workaround.

The issue I ran into is on this compressed input (b64-encoded) which appears to have been produced by lzutf8.js from this file:

PHRlc3RzdWl0ZSBlcnJvcnM9IjAiIGZhaWx1cmXGDWhvc3RuYW1lPSJtc2sxd3N0MDU2IiDGEnNlc3Np
b25zLsgJX2FwaS7QlNC+0YHRgtGD0L/QvccMjCDRjdC90LTQv9C+0LjQvdGC0LAgxQ270YPRh9C10L3Q
uNGPINGB0L/QuNGB0LrQsMQNtdGB0YHQuNC5IiBza2lwcGVk5QCW5QC4PSIxIiB0aeQAijEuMDQ3MDky
xhBzdGFtcD0iMjAyMC0wMi0xM1QxMjowMjo1MC42NzUyNDgiPgogICAg5QD/Y2FzZSBjbGFzc/8A1f8A
1f8A1fcA1cZ00J/RgNC+0LLQtdGAxSXQtNJv0Ljfb8tvy2JzdGF0dXM9InBhc3NlZPEBMT48c3lzdGVt
LW91dOcBFiFbQ0RBVEFbCkBzY2VuYXJpby5iZWdpbgogINCh0YbFaLDRgMRfOiD/ALv/ALv8ALvlAI3Q
mtC+0LPQtMQvodC+0LfEC+QAgMsr0Y/EdbvEB8ZSjNC35QCTsNGC0LXFGdGBINCw0YPFEMR9uNGE0Ljk
AK/RhtC40LXQuSDQstC+INCyxVaFxF245QC5tdC80LDRhSAuLi4g5gFCIGluIDAuODk5c+YAoZgg0J/V
dozET4XQvtC00LjRgiDQsiDQvNC+0LHQuMUtvdC+0LUg0L/lAUm70L7QtuYBBtC1IEF1cm9yYS1tYXJr
ZXTRezEyy3ue0YLFQ7DQsuQA6dC10YLRgeQBCrfQsMUZvuQA++0BHvgBcMgnxiHRguUAornHSuUCv7j6
AOvwAKUwMTjnAKWi6gHBkuUBwbLkAK/RidCw6gCtusRW0LjFeOQBduQBjOgDfyDnAN7mATAx0mowMMdq
mN9i0GJ0b2tl32DIYOsDC2VuZAot3wHfAdEB5QDAXV3nA4Av6wOSPC/oBKQ+CsYM5QWvPg==

Trying to decompress it, I eventually run into a situation where at input byte 568 I need to look for distance 1455 and I come up short -- my output string is only of length 1384 at this point. This means that I'm likely failing to inflate some data. Dumping the bytes exposes the issue in the input -- see the last line

0000 0b00111100 literal - <
0001 0b01110100 literal - t
0002 0b01100101 literal - e
0003 0b01110011 literal - s
0004 0b01110100 literal - t
0005 0b01110011 literal - s
0006 0b01110101 literal - u
0007 0b01101001 literal - i
0008 0b01110100 literal - t
0009 0b01100101 literal - e
0010 0b00100000 literal -
0011 0b01100101 literal - e
0012 0b01110010 literal - r
0013 0b01110010 literal - r
0014 0b01101111 literal - o
0015 0b01110010 literal - r
0016 0b01110011 literal - s
0017 0b00111101 literal - =
0018 0b00100010 literal - "
0019 0b00110000 literal - 0
0020 0b00100010 literal - "
0021 0b00100000 literal -
0022 0b01100110 literal - f
0023 0b01100001 literal - a
0024 0b01101001 literal - i
0025 0b01101100 literal - l
0026 0b01110101 literal - u
0027 0b01110010 literal - r
0028 0b01100101 literal - e
0029 0b11000110_00001101 pointer l=6 d=13
0031 0b01101000 literal - h
0032 0b01101111 literal - o
0033 0b01110011 literal - s
0034 0b01110100 literal - t
0035 0b01101110 literal - n
0036 0b01100001 literal - a
0037 0b01101101 literal - m
0038 0b01100101 literal - e
0039 0b00111101 literal - =
0040 0b00100010 literal - "
0041 0b01101101 literal - m
0042 0b01110011 literal - s
0043 0b01101011 literal - k
0044 0b00110001 literal - 1
0045 0b01110111 literal - w
0046 0b01110011 literal - s
0047 0b01110100 literal - t
0048 0b00110000 literal - 0
0049 0b00110101 literal - 5
0050 0b00110110 literal - 6
0051 0b00100010 literal - "
0052 0b00100000 literal -
0053 0b11000110_00010010 pointer l=6 d=18
0055 0b01110011 literal - s
0056 0b01100101 literal - e
0057 0b01110011 literal - s
0058 0b01110011 literal - s
0059 0b01101001 literal - i
0060 0b01101111 literal - o
0061 0b01101110 literal - n
0062 0b01110011 literal - s
0063 0b00101110 literal - .
0064 0b11001000_00001001 pointer l=8 d=9
0066 0b01011111 literal - _
0067 0b01100001 literal - a
0068 0b01110000 literal - p
0069 0b01101001 literal - i
0070 0b00101110 literal - .
0071 0b11010000_10010100 literal - Д
0073 0b11010000_10111110 literal - о
0075 0b11010001_10000001 literal - с
0077 0b11010001_10000010 literal - т
0079 0b11010001_10000011 literal - у
0081 0b11010000_10111111 literal - п
0083 0b11010000_10111101 literal - н
0085 0b11000111_00001100 pointer l=7 d=12
0087 0b10001100_00100000_11010001_10001101 WHAT IS HAPPENING

Byte 87 is where it's working through Доступность. The 4 bytes 0b10001100_00100000_11010001_10001101 don't appear to be a valid unicode sequence, but neither do they form a valid "sized pointer".

Bytes Sized pointer sequence UTF­8 Codepoint sequence
1 n/a 0xxxxxxx
2 110lllll 0ddddddd 110xxxxx 10xxxxxx
3 111lllll 0ddddddd dddddddd 1110xxxx 10xxxxxx 10xxxxxx
4 n/a 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

What's going wrong here? Am I trying to decompress invalid data, or failing to parse the input correctly?

rotemdan commented 3 years ago

The first thing I'd do is run the test function of the CLI tool on russian-unicode.xml:

node lzutf8-cli t russian-unicode.xml

Which returns:

Test result: *Passed* in 19.05ms

Since it passes it could be many things. I'm not sure exactly what.

There's a pointer on 0085 referring back to the previous substring ост which is of length 3 and has distance of 6.

I don't remember the exact details (this was designed and implemented about 6.5 years ago). If the decoded length and distance are wrong maybe the bits are not read with the right endianess?

Edit: now that I look at this again I ask myself why didn't I choose to use the distance to the end of the reference substring instead of its beginning? (I guess these are the kind of things I might want to rethink if I ever wanted to rewrite the library).

ianfixes commented 3 years ago

i did notice a few things as I implemented this. For example, you could automatically add 4 to the match length since that's the minimum -- it would move the range from 4-35 instead of 0-31 (where 0-3 is invalid). And yes, having the distance point to the end of the matched string. Too late now I suppose. Although you could reimplement your bucket strategy such that you don't artificially shorten from 64 to 32 entries -- you only need to throw out bucket entries when their position exceeds the maximum distance.

I noticed that ост is encoded as 6 bytes but the pointer requests 7 -- it's matching the first byte of ь. So my debugging function is just overreacting to finding what would be the second byte of ь. I'm going to permit that and start tracking total expected length to see if that unearths anything.

rotemdan commented 3 years ago

I actually remember considering having the minimal match length as 0 but I think I wanted to reserve 3, 2, 1, 0 as potential special codes for some purpose (I don't exactly remember what it was).

As for the bucket strategy, I don't even remember what that was about. That custom hash table thing turned out to be a lot of work and very error prone. I don't think that object is even used on modern browsers (it was designed as an alternative for browsers without Map objects - remember this was 2014). I believe today it always uses the Map objects and they probably give better performance.

This work was done so long ago and a lot of the motivation was died down over the years. The library didn't get much attention. I was initially excited about the idea but today it looks just like extra work that doesn't really yield anything. My main programming language at this moment is actually C++. I develop audio plugins (coincidentaly one of my current products is called a "compressor" but that's a completely different thing).

Since I've now become more fluent with C++ I could rewrite it in a much shorter time to get native speeds but the question is why? The idea may seem "cool" but there are standardized generic alternatives like gzip/brotli etc. that fit many use cases, and provide better compression ratios (possibly not as fast when optimized though).

I guess there are some specialized use cases where this approach is preferrable. I'm not sure if I even care anymore to be honest. This work hasn't really rewarded me with anything overall.

ianfixes commented 3 years ago

It's not my preference to port this library either, if I'm honest... but I am trying to work with something else that has a dependency on it, and being able to perform the compression/decompression in another language is currently the shortest path. It's unlikely that the dependency on lzutf8 would/could be dropped from the other project.

I think I stumbled on the answer, but it's in the form of what might be an undocumented "feature" in your library. Based on your edit:

I ask myself why didn't I choose to use the distance to the end of the reference substring instead of its beginning

it occurred to me that using the distance to the beginning of the string means that length can never be greater than distance. but when I explicitly check for that, i find these lines:

0680 0b00101101 literal - -
0681 0b11011111_00000001 pointer l=31 d=1
0683 0b11011111_00000001 pointer l=31 d=1
0685 0b11010001_00000001 pointer l=17 d=1

These are the encoding of a line of dashes (1 + 31 + 31 + 17 of them, apparently). "Fetch me 31 characters starting 1 character ago". I assume this is a hidden feature where you just repeat the pointed-to string until you reach the target length. But is that how your implementation actually works? I could imagine getting tripped up if in other instances it repeats just the first or just the last byte. Does this ring any bells as a feature? It could also be a bug in the linear comparison step.

I'm going to code it as "repeats" for now (i.e., I don't think I'm blocked and you can probably even close this issue) unless you know a reason why that would be unsafe.

rotemdan commented 3 years ago

I honestly don't recall anything at the moment. After several minutes looking at the code I still can't really understand anything that's going on :).

I'm not sure if I had a feature for "repeat". I do remember something about a sophisticated feature for future-overlapping sequences.

I've tested several inputs like:

--------------------------------------------------------------------------------------------
ababababababababababababababababababababababababababababababababababab

And they had the sort of behavior you show (the cli test command showed they were decompressed correctly).

I think what is happening is that the decompressor is incrementally decoding one byte at a time such that each element of the referenced 31 character sequence is being read from previously decompressed output buffer (not the input).

// Copy the match bytes to output
for (let offset = 0; offset < matchLength; offset++)
    this.outputByte(this.outputBuffer[matchPosition + offset]);

It starts with the first dash (plain literal character):

-

Now it decodes the second dash (first element of the substring):

--

Now it the third (second element of the substring):

---

And so on.

I remember this algorithm had many subtle areas like this.

The most difficult part was dealing with truncated multibyte sequences (to support streaming decompression for chunks of arbitrary lengths). If you noticed this function and related code in the decompressor:

private rollBackIfOutputBufferEndsWithATruncatedMultibyteSequence() {
  ....
}

That was very difficult to write and test. To ensure this worked correctly, I had to write a dedicated randomized test case generator to create arbitrary truncated multibyte chunks of random sizes (some as small as one byte).

This library ended up as an excruciating 2-3 month effort (full-time)! I did not expect it would take that long but I managed to finish it. I wouldn't recommend this to anyone. The best thing I'd recommend to do is to port the existing code or at least use the existing test suite.

Edit: Now thinking about this again, maybe this was the reason I chose to use the distance to the start of the substring and not the end. I don't remember. It may make sense since allowing for future-overlapping sequences seems like an effective feature to improve compression rates (not sure if it is used in most other dictionary-based compression algorithms).

ianfixes commented 3 years ago

Fortunately for me, I'm not working with streams. But I did catch that subtlety.

I can see that you're copying the output iteratively whereas my first crack at the implementation was just extracting the array segment in one shot using length and distance. I'm still doing that, but just repeating it as necessary if length > distance.

Overall this is a clever algorithm with some implementation gotchas. But very interesting that it's idempotent in decompression -- you can give it compressed or uncompressed UTF-8 and it will handle it just fine. It made for an interesting weekend project :)

rotemdan commented 3 years ago

I'm not sure if I would describe the "overlapping" reference as exactly repetition since it can have a size that is not an exact multiple of the referenced subsequence:

Say the input is:

abcabcabd

It is compressed as:

0: a
1: b
2: c
3: reference with distance 3 (distance should be encoded as "2" I believe), size 5
4: d

(5 is not a multiple of 3)

Decoding goes like this:

abc
abca
|--|
abcab
 |--|
abcabc
  |--|
abcabca
   |--|
abcabcab
    |--|
abcabcabd

Edit: changed size from 8 to 5 (my mistake)

ianfixes commented 3 years ago

Don't worry, I'm not exceeding the length.

This was my fix in ruby

-         output += output[-distance, length]
+         output += Enumerator.new { |y| loop { output[-distance, length].each { |v| y << v } } }.lazy.take(length).to_a