Closed H4ad closed 5 months ago
Also, there are very good options in the list, fflate
, adm-zip
, yauzl
and zip.js
.
What will be the goal of the package, be fast? use low memory? be simpler to use? I think this is the one the first things to decide.
At least for me, I chose yauzl
in the past because I needed to unzip large zip files inside an AWS lambda, so I didn't find other lib that was good enough for this use case.
node-unzip
and unzipper
are kind the same same, unzip with low memory usage but the API is using readable streams, looks good but the API isn't that good because of the time that thing was built.
If we choose the name unzipr, maybe we will not be able to publish because of unzipper (very similar name).
We failed, but because of "unzip" :D
unzipper
is broken in newer Node versions, and is the primary reason why I started looking for a better alternative, after we hit a bug in production.
yauzl
sounds like a great starting option.
What will be the goal of the package, be fast? use low memory? be simpler to use? I think this is the one the first things to decide.
My thought would be:
1) Support modern Node.js versions 2) Be reliable and stable 3) Be secure 4) Be as fast as points 1-3 allow 5) Consume as little memory as points 1-4 allow 6) Provide great developer experience 7) Have excellent code coverage
in that order.
I agree with all points, about being fast, yauzl has great points that we should learn from (I know nothing about zip spec so for me this is gold): https://github.com/thejoshwolfe/yauzl/blob/master/README.md#no-streaming-unzip-api
For the first thing, I think we should start a rewrite of the code to use modern syntax (and use TS) and keep the same API, just to force us to understand what are the limitations and what the library does.
Then, we can think and study for a better API.
Agreed! What do you think about https://github.com/yarnpkg/berry/tree/master/packages/yarnpkg-libzip? It's probably even faster, but documentation is non-existent. I wonder if it already supports everything that we need, maybe we can wrap around it. Worth checking out at the very least.
Reading from the source code, it looks like it loads everything in memory. but we can benchmark to see how it works (and if it really uses a lot of memory).
I will create a benchmark to evaluate this case.
thanks!
Hey folks, busy day today but I'll be happy to help. I'll check back soon to see how I can contribute.
My 2 cents:
Last time I had to deal with zip was about 4 years ago. It was a PITA because the zip format is kind of bad and always results in allocating huuuge memory blocks. While gzip is kind of simple and "stupid" and streaming files after each other, resulting that you can just forward throught the filestream till you find the file you need etc.. zip has a central directory at the end of the zip-file. So if you have no way to rewind the zip file (e.g. file upload), you need to store the whole content into the memory just to realize that the file you want to extract is not existing and release the memory.
I did not see any proper solution regarding this memory allocation issue. Maybe we should consider implementing solutions for such tasks. E.g. If we want to check if a zip file has a specific file, than we maybe should implement a special stream, which just ignores all the data till it reaches the central directory and checks if the file is in the file/central directory. Use case could be e.g. a file upload, where you want to ensure, that the zip file contains specific files.
Etc..
Also when I used zip I kind of used the "shape" of tar or tar-stream (forgot which one) interfaces for the extraction and archiving Stream. This was useful, because this makes it possible to replace tar streams with zip and vice versa. So we should check which "Stream" interface we want to use to increase the compatibility (e.g. with zlip streams).
Here's some results:
Zip Size: 152,9Mb Code: https://github.com/kibertoad/unzipomatic/tree/chore/benchmark/benchmark
yauzl
:
Files Read: 20001
node bench-yauzl.mjs 0,24s user 0,22s system 186% cpu 0,249 total
avg shared (code): 0 KB
avg unshared (data/stack): 0 KB
total (sum): 0 KB
max memory: 49 MB
page faults from disk: 5
other page faults: 22358
yarnzip
:
Files Read: 20001
node bench-yarnzip.mjs 0,24s user 0,18s system 135% cpu 0,307 total
avg shared (code): 0 KB
avg unshared (data/stack): 0 KB
total (sum): 0 KB
max memory: 358 MB
page faults from disk: 1
other page faults: 102460
In both libraries you need to manually read the file content, the yarnzip
is very easy to read the file content but yauzl
is very painful (since I need to open an stream and then read the whole stream), yarnzip
you just need to call readFileSync
.
The first obvious thing we should improve is the API of yauzl
, to use Generators
or callback/stream
.
Then, we should improve the usability of reading the file content, we can leave the openReadStream
but we should try to add simpler/easier ways to read the content (returning in string or buffer) for those who need the file content easily.
So you prefer yauzl?
@Uzlopak yeah, the library is very simple (one file, 800 lines), no wasm (is not bad but I have no experience using it), very low memory usage, as fast as yarnzip
.
I think we just need to improve the developer ux, the library itself is already very optimized.
One is faster, another is more memory-efficient. We can improve api of either.
@H4ad is "user" the only important metric?
@kibertoad What do you mean? I think the yauzl
can achieve the goals you mentioned and also not be a huge challenge to port/refactor (yarnlib is very big in size/lines of code)
@H4ad in benchmarks you shared total execution time is is different, but user execution tome is same. and yeah, native js definitely will be easier to maintain and consume
I agree, that yauzl seems to be easier to grok. Maybe also easier to refactor and to optimize.
The user
probably is the time spent on js
, the system
is probably the time spent on syscalls, yarn will be faster on that because it reads the whole file in memory).
I say "probably" because I never really thought about those metrics, I always look at
total time
.
But the important metric is total time
, which is ~0,25 for yauzl and ~0,3 for yarnzip.
yauzl
is faster, use less memory and also is easier to maintain.
About yauzl
, someone already tried to improve that library: https://github.com/overlookmotel/yauzl-promise
Should we create some PRs to yaulz-promise
or is better to have our own version?
I think we want to add more APIs, not only port the current version, so I don't know if is worth.
I kind of dont like codebase of yaulz-promise. I feel yaulz maybe too simple, but I think with our expertise, we would implement it differently.
How about a benchmark @H4ad ?
Benchmark for yauzl-promise
:
Files Read: 20001
node bench-yauzl-promise.mjs 0,36s user 0,34s system 193% cpu 0,363 total
avg shared (code): 0 KB
avg unshared (data/stack): 0 KB
total (sum): 0 KB
max memory: 84 MB
page faults from disk: 1
other page faults: 33186
The slowdown is probably caused by the usage of Generators
, so it's not a huge problem.
The only thing I didn't understand is why it takes more memory.
@H4ad can you create a pr with benchmarks put to a subfolder? useful thing to have
Well, based on our discussion, I think the yauzl
will be our choice.
If so, then we just need to create a PR with the initial code.
I will create more issues to then address refactoring of code, refactoring of tests, and discussing new API (since we will probably avoid callbacks).
Yup, yauzl
as a primary starting point makes sense, we can use other libraries as a point of reference and borrow from them if there is something valuable there that could be reused.
I will mention this in the readme.
Options:
If we choose the name
unzipr
, maybe we will not be able to publish because ofunzipper
(very similar name).