Closed 101arrowz closed 4 years ago
Thanks for info. Any fork or alternative is good - that's how opensource make things move forward.
Would you be OK, then, with a PR that changes this line to force fewer hash collisions? Obviously this would differ from true Zlib.
By default, i would prefer to keep output binary equal with ZLIB.
BUT! Probably this can be done via global .enable_patches
option (switch between legacy/advanced behaviour)? Anyway, this suggestion is valuable and worth to post as separate issue. Even if we can not invent how to accept it immediately, it should not be lost.
I can not give final answer without experimenting with benchmarks. At first glance chances are high.
After digging internet a bit...
.enable_patches
.FYI:
> deflate-pako x 10.08 ops/sec ±0.28% (28 runs sampled)
> gzip-pako x 7.81 ops/sec ±0.62% (23 runs sampled)
> inflate-pako x 105 ops/sec ±0.33% (77 runs sampled)
> ungzip-pako x 81.71 ops/sec ±0.43% (68 runs sampled)
crc32 takes 20%, at least here. No ideas why. But the only difference between deflate/gzip should be adler32/crc32.
That seems quite strange to me, fflate
runs CRC using a standard 1-byte hash table but does not impact performance much (maybe 5%). I will investigate this too.
May be, JIT differences for es5/es6 syntax. They do strange things sometime. For example, if you compare node.js from v8, there were series of notable regressions in es5.
Anyway, Slicing-By-4 needs uint32 aligned access. Madness with multiple ArrayBuffer views (8-bit and 32-bit) not worth efforts.
I've started to experiment with the codebase and will report back if I make progress. The hashing is not centralized, i.e. you set it in many different places, so I need to deep-dive into the code to tune its performance.
On another note, I saw you pushed some new commits for v2.0.0. If you're going to be releasing a new version, I'd recommend you consider a new idea I came up with: auto-workerization.
Typical workerization techniques do not allow you to reuse existing synchronous code in an asynchronous fashion, but I developed a function that can take a function (along with all of its dependencies), generate an inline worker string, and cache it for higher performance. It's been specifically optimized to work even after mangling, minification, and usage in any bundler. More importantly, you can reference other functions, classes, and static consants with this new method.
The main reason I suggest this is that very, very few people want to cause the environment (which could very well be a user's browser) to hang while running a CPU intensive task on the main thread, such as deflate
. Offering a method to offload this to a worker automatically is incredibly useful for many less-experienced developers who don't understand workers and just want high performance and good UX.
You can see how I did it in fflate if you'd like to consider it.
Workers are out of scope at this moment. This lib is still zlib port by nature, a build-block for upper abstraction layers. If my intent could be to create just inflate/deflate... But i'd prefer full-featured zlib port. Anyway, it's still possible to bump version again :)
I'm familiar with workification, pica uses it actively. Browserify has webworkify
package, doing all messy stuff. Unfortunately, this no available for rollup & webpack. If you know transparent solution, how to extract function source with all dependencies, that would be interesting.
I've pushed temporary es6 branch for your info. There are just syntax changes. Still not covered inflate & tests.
The last unpleasant problem is lack of original zlib 1.2.9 for binary compare deflate output. node.js switched to another lib after v12.16. Still not decided how to solve. Ideally - would be nice to have docker image to build light webassembly wrapper for testing purposes.
The hashing is not centralized, i.e. you set it in many different places, so I need to deep-dive into the code to tune its performance.
It's just unrolled macro, easy to search. I'll be ok with any form of PR. Even if that will be not working example for single place, with comment "spread to all such entries"
PS. performance of crc32 self-fixed a some moment :)
To answer your question: the current system requires you to pass a dependency array, but beyond that all workerization is handled for you. This is done at runtime and supports any environment for that reason. Not possible still to workerize without dependency list, which is why the library author must provide the async form with my solution. The benefit over existing solutions is that it works in any bundler or environment and has no impact on bundle size.
The change in the Node 12 Zlib binary is precisely why I made this issue; shifting - away from Node.js parity can be justified now.
I checked again and you are right, it's simply a macro. I may move this to a function instead. By the way, my solution relies on memLevel and requires between 4kB and 16MB extra memory (for most users 64kB).
To answer your question: the current system requires you to pass a dependency array, but beyond that all workerization is handled for you. This is done at runtime and supports any environment for that reason. Not possible still to workerize without dependency list, which is why the library author must provide the async form with my solution.
Go it. Nice idea.
I've landed all changes to master, and moved hash to separate function https://github.com/nodeca/pako/commit/677ba487ed68f072fba782b5bf55e8b1007357f4.
If i switch to alternate hasher, result is strange - deflate faster, but inflate (from "improved deflate") slower:
HASH_ZLIB:
Sample: lorem_1mb.txt (1000205 bytes raw / ~257018 bytes compressed)
> deflate-imaya x 4.94 ops/sec ±3.07% (16 runs sampled)
> deflate-pako x 10.00 ops/sec ±0.22% (29 runs sampled)
> deflate-zlib x 18.68 ops/sec ±1.31% (48 runs sampled)
> gzip-pako x 10.09 ops/sec ±0.32% (28 runs sampled)
> inflate-imaya x 106 ops/sec ±1.04% (76 runs sampled)
> inflate-pako x 135 ops/sec ±0.85% (83 runs sampled)
> inflate-zlib x 400 ops/sec ±0.56% (87 runs sampled)
> ungzip-pako x 117 ops/sec ±0.47% (81 runs sampled)
HASH_FAST:
Sample: lorem_1mb.txt (1000205 bytes raw / ~536897 bytes compressed)
> deflate-imaya x 4.95 ops/sec ±4.05% (16 runs sampled)
> deflate-pako x 15.60 ops/sec ±0.23% (41 runs sampled)
> deflate-zlib x 18.54 ops/sec ±0.20% (49 runs sampled)
> gzip-pako x 15.33 ops/sec ±0.64% (41 runs sampled)
> inflate-imaya x 57.95 ops/sec ±0.46% (72 runs sampled)
> inflate-pako x 89.50 ops/sec ±0.81% (78 runs sampled)
> inflate-zlib x 248 ops/sec ±0.53% (86 runs sampled)
> ungzip-pako x 84.95 ops/sec ±1.98% (71 runs sampled)
any ideas?
All above with default memLevel (8) => hash table size 32K (not 64K).
memLevel 9 improve deflate speed 15 => 16 (not too much). So, left default to simplify investigations.
Unfortunately I've not had much time this week to spend on this PR, but I'll get to it in a day or two (weekend).
The hash function you referred to in that code is suboptimal. UZIP uses it, but as it turns out UZIP is much slower for large files than fflate
. What I found most effective was taking equally-spaced XOR and cutting off the third byte if necessary.
See this snippet for what I used.
As for why inflate
becomes slower, I'm not sure. I'll look into it.
Hmm i don't understand how to express that hash via "standard" signature https://github.com/nodeca/pako/blob/677ba487ed68f072fba782b5bf55e8b1007357f4/lib/zlib/deflate.js#L100
Anyway, take a look at inflate speed after YOUR compression, will be interesting to know.
BTW, AFAIK cloudflare zlib fork switched to crc32 with guaranteed result distribution.
Quick-checked
let HASH_FAST2 = (s, prev, data) => crc32_single(prev, data) & s.hash_mask;
- not much faster (+10%), but slow down inflate. Proably, deflate no enough fast because function was exported. let HASH_FAST = (s, prev, data) => ((prev << 8) + (prev >> 8) + (data << 4)) & s.hash_mask;
- faser (+50%), but slower inflateFor mine you need access to three bytes at ind, ind + 1, and ind + 2. prev can be ind, data can be ind + 1, and you need a new parameter for ind + 2.
I'm really confused as to how your inflate performance is hurt that much. I'll definitely look into it.
I've been trying to add my hash variant but the performance is somehow much worse (2x slower). There may be something I'm doing wrong; all I changed was adding an extra parameter for an extra byte, which I accessed in each call to HASH
. I noticed the hashing uses the previous hash rather than the previous byte, which could be an issue. I'll continue to look into it.
What about inflate speed?
https://github.com/nodeca/pako/blob/master/benchmark/benchmark.js#L49 - replace here pako output with your output.
I found reason why inflate speed decrease - because compression ratio much worse (bigger inflate input => slower)
let HASH_ZLIB = (s, prev, data) => ((prev << s.hash_shift) ^ data) & s.hash_mask;
pako deflate level 6, memLevel 8 => 257018 bytes
let HASH_FAST2 = (s, prev, data) => crc16(prev, data) & s.hash_mask;
pako deflate level 6, memLevel 8 => 536479 bytes
let HASH_FAST = (s, prev, data) => ((prev << 8) + (prev >> 8) + (data << 4)) & s.hash_mask;
pako deflate level 6, memLevel 8 => 536897 bytes
Inrease of memLevel to 9 (for 64k hash table) does not help.
I'm still working on this, I've just been struggling with implementing the actual hash function. Could you find a way to get three bytes to the hasher instead of just the previous hash and the current byte?
AFAIK hash = 2 bytes. To be honest, i don't understand where concept of "3 bytes" is taken from. Usually hashes for sequences are "recursive"
3 bytes is taken from the fact that the minimum length of an LZ77 sequence is 3 bytes in DEFLATE. Therefore, it's better to ensure all three bytes are unique than just two. It's true the recursive hash works as well for this, but it just seems to be less accurate from my testing.
Therefore, it's better to ensure all three bytes are unique than just two.
I think, that's not too critical, because:
I mean, your variant should be a bit better, but would require significant code rework for this concrete package. IMO "classic" recursive hash would be more optimal start point.
I've been looking into optimizing the function with a rolling hash, but I'm not sure how to get it done. The only thing I can conceive is capping the previous hash at 2 * memLevel bits, then appending memLevel bits to the end for each new byte. Would that work?
EDIT: After much experimentation, the original hash seems about as fast as it gets given the constraints. So this PR is dead until a rework is possible.
I saw what was mentioned in #135 and #136 and decided to try out the performance benchmarks for myself.
If you would like to experiment with the patches I made, I forked the repo: 101arrowz/pako.
Note the addition of
fflate
anduzip
.uzip
was mentioned in the previous issues but is, as you mentioned, slightly unstable. It hangs on bad input rather than throwing an error, cannot be configured much, cannot be used asynchronously or in parallel (multiple calls toinflate
in separate callbacks, for example, causes it to fail), and, worst of all, does not support streaming.I created
fflate
to resolve these issues while implementing some of the popular features requested here (e.g. ES6 modules). In addition,fflate
adds asynchronous support with Web and Node workers (offload to a separate thread entirely rather than just add to event loop) and ZIP support that is parallelized to easily beat out JSZip. The bundle size ends up much smaller thanpako
too because of the much simpler code. As you can see it is much faster thanpako
.The purpose of this issue is not to boast about my library or to call
pako
a bad one. I believepako
can become much faster while still being more similar thanfflate
oruzip
to the realzlib
(i.e. the internallib/
directory). The overhauled implementation in Node 12 as mentioned in #193 is the perfect chance to escape the trap of trying to match the C source nearly line-for-line and instead become a canonical but high-performance rework in JavaScript. For this reason, I suggest you try to optimize some of the slower parts of the library. I found an even better hashing function that produced very few collisions in my testing, and for larger files/image files it outperforms even Node.jszlib
in C; if you'd like to discuss, please let me know.Personally: thank you for your incredible contributions to the open source community. I use
pako
in multiple projects and greatly appreciate its stability, not present in libraries such as my own.