Switch to faster hashing for deflate

101arrowz commented 4 years ago

I saw what was mentioned in #135 and #136 and decided to try out the performance benchmarks for myself.

Selected samples: (1 of 1)
 > lorem_1mb

Sample: lorem_1mb.txt (1000205 bytes raw / ~257018 bytes compressed)
 > deflate-fflate x 29.27 ops/sec ±1.97% (52 runs sampled)
 > deflate-fflate-string x 25.85 ops/sec ±2.58% (47 runs sampled)
 > deflate-imaya x 7.41 ops/sec ±2.72% (22 runs sampled)
 > deflate-pako x 16.39 ops/sec ±1.44% (44 runs sampled)
 > deflate-pako-string x 15.05 ops/sec ±0.50% (41 runs sampled)
 > deflate-pako-untyped x 11.53 ops/sec ±1.06% (31 runs sampled)
 > deflate-uzip x 29.20 ops/sec ±2.41% (51 runs sampled)
 > deflate-zlib x 25.09 ops/sec ±0.41% (45 runs sampled)
 > inflate-fflate x 252 ops/sec ±1.07% (87 runs sampled)
 > inflate-fflate-string x 141 ops/sec ±0.75% (78 runs sampled)
 > inflate-imaya x 159 ops/sec ±1.37% (83 runs sampled)
 > inflate-pako x 193 ops/sec ±0.32% (87 runs sampled)
 > inflate-pako-string x 59.54 ops/sec ±0.31% (61 runs sampled)
 > inflate-pako-untyped x 65.32 ops/sec ±1.22% (64 runs sampled)
 > inflate-uzip x 274 ops/sec ±0.73% (86 runs sampled)
 > inflate-zlib x 589 ops/sec ±0.29% (93 runs sampled)

If you would like to experiment with the patches I made, I forked the repo: 101arrowz/pako.

Note the addition of fflate and uzip. uzip was mentioned in the previous issues but is, as you mentioned, slightly unstable. It hangs on bad input rather than throwing an error, cannot be configured much, cannot be used asynchronously or in parallel (multiple calls to inflate in separate callbacks, for example, causes it to fail), and, worst of all, does not support streaming.

I created fflate to resolve these issues while implementing some of the popular features requested here (e.g. ES6 modules). In addition, fflate adds asynchronous support with Web and Node workers (offload to a separate thread entirely rather than just add to event loop) and ZIP support that is parallelized to easily beat out JSZip. The bundle size ends up much smaller than pako too because of the much simpler code. As you can see it is much faster than pako.

The purpose of this issue is not to boast about my library or to call pako a bad one. I believe pako can become much faster while still being more similar than fflate or uzip to the real zlib (i.e. the internal lib/ directory). The overhauled implementation in Node 12 as mentioned in #193 is the perfect chance to escape the trap of trying to match the C source nearly line-for-line and instead become a canonical but high-performance rework in JavaScript. For this reason, I suggest you try to optimize some of the slower parts of the library. I found an even better hashing function that produced very few collisions in my testing, and for larger files/image files it outperforms even Node.js zlib in C; if you'd like to discuss, please let me know.

Personally: thank you for your incredible contributions to the open source community. I use pako in multiple projects and greatly appreciate its stability, not present in libraries such as my own.

puzrin commented 4 years ago

Thanks for info. Any fork or alternative is good - that's how opensource make things move forward.

101arrowz commented 4 years ago

Would you be OK, then, with a PR that changes this line to force fewer hash collisions? Obviously this would differ from true Zlib.

puzrin commented 4 years ago

By default, i would prefer to keep output binary equal with ZLIB.

BUT! Probably this can be done via global .enable_patches option (switch between legacy/advanced behaviour)? Anyway, this suggestion is valuable and worth to post as separate issue. Even if we can not invent how to accept it immediately, it should not be lost.

I can not give final answer without experimenting with benchmarks. At first glance chances are high.

puzrin commented 4 years ago

After digging internet a bit...

Could you PR initially proposed (not binary compatible) changes? It will not be mrged, but will help significantly to investigate how to add .enable_patches.
For gzip/ungzip second weak place is crc32 speed. This can be improved at least twice with Slicing-By-4 algorithm https://create.stephan-brumme.com/crc32/#slicing-by-8-overview

101arrowz commented 4 years ago

I can do that and will submit a PR when I find time, maybe a few days.
I disagree with trying to optimize CRC32, the speed of GZIP is limited by DEFLATE and not CRC if you do them in parallel. Optimizing performance of CRC would yield maybe 1-2% faster overall speed.

puzrin commented 4 years ago

FYI:

 > deflate-pako x 10.08 ops/sec ±0.28% (28 runs sampled)
 > gzip-pako x 7.81 ops/sec ±0.62% (23 runs sampled)
 > inflate-pako x 105 ops/sec ±0.33% (77 runs sampled)
 > ungzip-pako x 81.71 ops/sec ±0.43% (68 runs sampled)

crc32 takes 20%, at least here. No ideas why. But the only difference between deflate/gzip should be adler32/crc32.

101arrowz commented 4 years ago

That seems quite strange to me, fflate runs CRC using a standard 1-byte hash table but does not impact performance much (maybe 5%). I will investigate this too.

puzrin commented 4 years ago

May be, JIT differences for es5/es6 syntax. They do strange things sometime. For example, if you compare node.js from v8, there were series of notable regressions in es5.

Anyway, Slicing-By-4 needs uint32 aligned access. Madness with multiple ArrayBuffer views (8-bit and 32-bit) not worth efforts.

101arrowz commented 4 years ago

I've started to experiment with the codebase and will report back if I make progress. The hashing is not centralized, i.e. you set it in many different places, so I need to deep-dive into the code to tune its performance.

On another note, I saw you pushed some new commits for v2.0.0. If you're going to be releasing a new version, I'd recommend you consider a new idea I came up with: auto-workerization.

Typical workerization techniques do not allow you to reuse existing synchronous code in an asynchronous fashion, but I developed a function that can take a function (along with all of its dependencies), generate an inline worker string, and cache it for higher performance. It's been specifically optimized to work even after mangling, minification, and usage in any bundler. More importantly, you can reference other functions, classes, and static consants with this new method.

The main reason I suggest this is that very, very few people want to cause the environment (which could very well be a user's browser) to hang while running a CPU intensive task on the main thread, such as deflate. Offering a method to offload this to a worker automatically is incredibly useful for many less-experienced developers who don't understand workers and just want high performance and good UX.

You can see how I did it in fflate if you'd like to consider it.

puzrin commented 4 years ago

Workers are out of scope at this moment. This lib is still zlib port by nature, a build-block for upper abstraction layers. If my intent could be to create just inflate/deflate... But i'd prefer full-featured zlib port. Anyway, it's still possible to bump version again :)

I'm familiar with workification, pica uses it actively. Browserify has webworkify package, doing all messy stuff. Unfortunately, this no available for rollup & webpack. If you know transparent solution, how to extract function source with all dependencies, that would be interesting.

I've pushed temporary es6 branch for your info. There are just syntax changes. Still not covered inflate & tests.

The last unpleasant problem is lack of original zlib 1.2.9 for binary compare deflate output. node.js switched to another lib after v12.16. Still not decided how to solve. Ideally - would be nice to have docker image to build light webassembly wrapper for testing purposes.

The hashing is not centralized, i.e. you set it in many different places, so I need to deep-dive into the code to tune its performance.

It's just unrolled macro, easy to search. I'll be ok with any form of PR. Even if that will be not working example for single place, with comment "spread to all such entries"

PS. performance of crc32 self-fixed a some moment :)

101arrowz commented 4 years ago

To answer your question: the current system requires you to pass a dependency array, but beyond that all workerization is handled for you. This is done at runtime and supports any environment for that reason. Not possible still to workerize without dependency list, which is why the library author must provide the async form with my solution. The benefit over existing solutions is that it works in any bundler or environment and has no impact on bundle size.

The change in the Node 12 Zlib binary is precisely why I made this issue; shifting - away from Node.js parity can be justified now.

I checked again and you are right, it's simply a macro. I may move this to a function instead. By the way, my solution relies on memLevel and requires between 4kB and 16MB extra memory (for most users 64kB).

puzrin commented 4 years ago

To answer your question: the current system requires you to pass a dependency array, but beyond that all workerization is handled for you. This is done at runtime and supports any environment for that reason. Not possible still to workerize without dependency list, which is why the library author must provide the async form with my solution.

Go it. Nice idea.

puzrin commented 4 years ago

I've landed all changes to master, and moved hash to separate function https://github.com/nodeca/pako/commit/677ba487ed68f072fba782b5bf55e8b1007357f4.

If i switch to alternate hasher, result is strange - deflate faster, but inflate (from "improved deflate") slower:

HASH_ZLIB:

Sample: lorem_1mb.txt (1000205 bytes raw / ~257018 bytes compressed)
 > deflate-imaya x 4.94 ops/sec ±3.07% (16 runs sampled)
 > deflate-pako x 10.00 ops/sec ±0.22% (29 runs sampled)
 > deflate-zlib x 18.68 ops/sec ±1.31% (48 runs sampled)
 > gzip-pako x 10.09 ops/sec ±0.32% (28 runs sampled)
 > inflate-imaya x 106 ops/sec ±1.04% (76 runs sampled)
 > inflate-pako x 135 ops/sec ±0.85% (83 runs sampled)
 > inflate-zlib x 400 ops/sec ±0.56% (87 runs sampled)
 > ungzip-pako x 117 ops/sec ±0.47% (81 runs sampled)

HASH_FAST:

Sample: lorem_1mb.txt (1000205 bytes raw / ~536897 bytes compressed)
 > deflate-imaya x 4.95 ops/sec ±4.05% (16 runs sampled)
 > deflate-pako x 15.60 ops/sec ±0.23% (41 runs sampled)
 > deflate-zlib x 18.54 ops/sec ±0.20% (49 runs sampled)
 > gzip-pako x 15.33 ops/sec ±0.64% (41 runs sampled)
 > inflate-imaya x 57.95 ops/sec ±0.46% (72 runs sampled)
 > inflate-pako x 89.50 ops/sec ±0.81% (78 runs sampled)
 > inflate-zlib x 248 ops/sec ±0.53% (86 runs sampled)
 > ungzip-pako x 84.95 ops/sec ±1.98% (71 runs sampled)

any ideas?

puzrin commented 4 years ago

All above with default memLevel (8) => hash table size 32K (not 64K).

memLevel 9 improve deflate speed 15 => 16 (not too much). So, left default to simplify investigations.

101arrowz commented 4 years ago

Unfortunately I've not had much time this week to spend on this PR, but I'll get to it in a day or two (weekend).

The hash function you referred to in that code is suboptimal. UZIP uses it, but as it turns out UZIP is much slower for large files than fflate. What I found most effective was taking equally-spaced XOR and cutting off the third byte if necessary.

See this snippet for what I used.

As for why inflate becomes slower, I'm not sure. I'll look into it.

puzrin commented 4 years ago

Hmm i don't understand how to express that hash via "standard" signature https://github.com/nodeca/pako/blob/677ba487ed68f072fba782b5bf55e8b1007357f4/lib/zlib/deflate.js#L100

Anyway, take a look at inflate speed after YOUR compression, will be interesting to know.

BTW, AFAIK cloudflare zlib fork switched to crc32 with guaranteed result distribution.

puzrin commented 4 years ago

Quick-checked

let HASH_FAST2 = (s, prev, data) => crc32_single(prev, data) & s.hash_mask; - not much faster (+10%), but slow down inflate. Proably, deflate no enough fast because function was exported.
let HASH_FAST = (s, prev, data) => ((prev << 8) + (prev >> 8) + (data << 4)) & s.hash_mask; - faser (+50%), but slower inflate
your one - could not understand how to write.

101arrowz commented 4 years ago

For mine you need access to three bytes at ind, ind + 1, and ind + 2. prev can be ind, data can be ind + 1, and you need a new parameter for ind + 2.

I'm really confused as to how your inflate performance is hurt that much. I'll definitely look into it.

101arrowz commented 4 years ago

I've been trying to add my hash variant but the performance is somehow much worse (2x slower). There may be something I'm doing wrong; all I changed was adding an extra parameter for an extra byte, which I accessed in each call to HASH. I noticed the hashing uses the previous hash rather than the previous byte, which could be an issue. I'll continue to look into it.

puzrin commented 4 years ago

What about inflate speed?

https://github.com/nodeca/pako/blob/master/benchmark/benchmark.js#L49 - replace here pako output with your output.

puzrin commented 4 years ago

I found reason why inflate speed decrease - because compression ratio much worse (bigger inflate input => slower)

let HASH_ZLIB = (s, prev, data) => ((prev << s.hash_shift) ^ data) & s.hash_mask;

pako deflate level 6, memLevel 8 => 257018 bytes

let HASH_FAST2 = (s, prev, data) => crc16(prev, data) & s.hash_mask;

pako deflate level 6, memLevel 8 => 536479 bytes

let HASH_FAST = (s, prev, data) => ((prev << 8) + (prev >> 8) + (data << 4)) & s.hash_mask;

pako deflate level 6, memLevel 8 => 536897 bytes

Inrease of memLevel to 9 (for 64k hash table) does not help.

101arrowz commented 4 years ago

I'm still working on this, I've just been struggling with implementing the actual hash function. Could you find a way to get three bytes to the hasher instead of just the previous hash and the current byte?

puzrin commented 3 years ago

AFAIK hash = 2 bytes. To be honest, i don't understand where concept of "3 bytes" is taken from. Usually hashes for sequences are "recursive"

101arrowz commented 3 years ago

3 bytes is taken from the fact that the minimum length of an LZ77 sequence is 3 bytes in DEFLATE. Therefore, it's better to ensure all three bytes are unique than just two. It's true the recursive hash works as well for this, but it just seems to be less accurate from my testing.

puzrin commented 3 years ago

Therefore, it's better to ensure all three bytes are unique than just two.

I think, that's not too critical, because:

3-bytes sequences should cause low % of total collisions
Full check is quick (only 3 bytes to compare).

I mean, your variant should be a bit better, but would require significant code rework for this concrete package. IMO "classic" recursive hash would be more optimal start point.

101arrowz commented 3 years ago

I've been looking into optimizing the function with a rolling hash, but I'm not sure how to get it done. The only thing I can conceive is capping the previous hash at 2 * memLevel bits, then appending memLevel bits to the end for each new byte. Would that work?

EDIT: After much experimentation, the original hash seems about as fast as it gets given the constraints. So this PR is dead until a rework is possible.

nodeca / pako

Switch to faster hashing for deflate #195