Choose a lib to start from

H4ad commented 5 months ago

Options:

If we choose the name unzipr, maybe we will not be able to publish because of unzipper (very similar name).

H4ad commented 5 months ago

Also, there are very good options in the list, fflate, adm-zip, yauzl and zip.js.

What will be the goal of the package, be fast? use low memory? be simpler to use? I think this is the one the first things to decide.

H4ad commented 5 months ago

At least for me, I chose yauzl in the past because I needed to unzip large zip files inside an AWS lambda, so I didn't find other lib that was good enough for this use case.

node-unzip and unzipper are kind the same same, unzip with low memory usage but the API is using readable streams, looks good but the API isn't that good because of the time that thing was built.

kibertoad commented 5 months ago

If we choose the name unzipr, maybe we will not be able to publish because of unzipper (very similar name).

We failed, but because of "unzip" :D

kibertoad commented 5 months ago

unzipper is broken in newer Node versions, and is the primary reason why I started looking for a better alternative, after we hit a bug in production. yauzl sounds like a great starting option.

kibertoad commented 5 months ago

What will be the goal of the package, be fast? use low memory? be simpler to use? I think this is the one the first things to decide.

My thought would be:

1) Support modern Node.js versions 2) Be reliable and stable 3) Be secure 4) Be as fast as points 1-3 allow 5) Consume as little memory as points 1-4 allow 6) Provide great developer experience 7) Have excellent code coverage

in that order.

H4ad commented 5 months ago

I agree with all points, about being fast, yauzl has great points that we should learn from (I know nothing about zip spec so for me this is gold): https://github.com/thejoshwolfe/yauzl/blob/master/README.md#no-streaming-unzip-api

For the first thing, I think we should start a rewrite of the code to use modern syntax (and use TS) and keep the same API, just to force us to understand what are the limitations and what the library does.

Then, we can think and study for a better API.

kibertoad commented 5 months ago

Agreed! What do you think about https://github.com/yarnpkg/berry/tree/master/packages/yarnpkg-libzip? It's probably even faster, but documentation is non-existent. I wonder if it already supports everything that we need, maybe we can wrap around it. Worth checking out at the very least.

H4ad commented 5 months ago

Reading from the source code, it looks like it loads everything in memory. but we can benchmark to see how it works (and if it really uses a lot of memory).

I will create a benchmark to evaluate this case.

kibertoad commented 5 months ago

thanks!

DaniGuardiola commented 5 months ago

Hey folks, busy day today but I'll be happy to help. I'll check back soon to see how I can contribute.

Uzlopak commented 5 months ago

My 2 cents:

Last time I had to deal with zip was about 4 years ago. It was a PITA because the zip format is kind of bad and always results in allocating huuuge memory blocks. While gzip is kind of simple and "stupid" and streaming files after each other, resulting that you can just forward throught the filestream till you find the file you need etc.. zip has a central directory at the end of the zip-file. So if you have no way to rewind the zip file (e.g. file upload), you need to store the whole content into the memory just to realize that the file you want to extract is not existing and release the memory.

I did not see any proper solution regarding this memory allocation issue. Maybe we should consider implementing solutions for such tasks. E.g. If we want to check if a zip file has a specific file, than we maybe should implement a special stream, which just ignores all the data till it reaches the central directory and checks if the file is in the file/central directory. Use case could be e.g. a file upload, where you want to ensure, that the zip file contains specific files.

Etc..

Also when I used zip I kind of used the "shape" of tar or tar-stream (forgot which one) interfaces for the extraction and archiving Stream. This was useful, because this makes it possible to replace tar streams with zip and vice versa. So we should check which "Stream" interface we want to use to increase the compatibility (e.g. with zlip streams).

H4ad commented 5 months ago

Here's some results:

Zip Size: 152,9Mb Code: https://github.com/kibertoad/unzipomatic/tree/chore/benchmark/benchmark

yauzl:

Files Read: 20001
node bench-yauzl.mjs   0,24s  user 0,22s system 186% cpu 0,249 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                49 MB
page faults from disk:     5
other page faults:         22358

yarnzip:

Files Read: 20001
node bench-yarnzip.mjs   0,24s  user 0,18s system 135% cpu 0,307 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                358 MB
page faults from disk:     1
other page faults:         102460

In both libraries you need to manually read the file content, the yarnzip is very easy to read the file content but yauzl is very painful (since I need to open an stream and then read the whole stream), yarnzip you just need to call readFileSync.

H4ad commented 5 months ago

The first obvious thing we should improve is the API of yauzl, to use Generators or callback/stream.

Then, we should improve the usability of reading the file content, we can leave the openReadStream but we should try to add simpler/easier ways to read the content (returning in string or buffer) for those who need the file content easily.

Uzlopak commented 5 months ago

So you prefer yauzl?

H4ad commented 5 months ago

@Uzlopak yeah, the library is very simple (one file, 800 lines), no wasm (is not bad but I have no experience using it), very low memory usage, as fast as yarnzip.

I think we just need to improve the developer ux, the library itself is already very optimized.

kibertoad commented 5 months ago

One is faster, another is more memory-efficient. We can improve api of either.

kibertoad commented 5 months ago

@H4ad is "user" the only important metric?

H4ad commented 5 months ago

@kibertoad What do you mean? I think the yauzl can achieve the goals you mentioned and also not be a huge challenge to port/refactor (yarnlib is very big in size/lines of code)

kibertoad commented 5 months ago

@H4ad in benchmarks you shared total execution time is is different, but user execution tome is same. and yeah, native js definitely will be easier to maintain and consume

Uzlopak commented 5 months ago

I agree, that yauzl seems to be easier to grok. Maybe also easier to refactor and to optimize.

H4ad commented 5 months ago

The user probably is the time spent on js, the system is probably the time spent on syscalls, yarn will be faster on that because it reads the whole file in memory).

I say "probably" because I never really thought about those metrics, I always look at total time.

But the important metric is total time, which is ~0,25 for yauzl and ~0,3 for yarnzip.

yauzl is faster, use less memory and also is easier to maintain.

H4ad commented 5 months ago

About yauzl, someone already tried to improve that library: https://github.com/overlookmotel/yauzl-promise

H4ad commented 5 months ago

Should we create some PRs to yaulz-promise or is better to have our own version?

I think we want to add more APIs, not only port the current version, so I don't know if is worth.

Uzlopak commented 5 months ago

I kind of dont like codebase of yaulz-promise. I feel yaulz maybe too simple, but I think with our expertise, we would implement it differently.

How about a benchmark @H4ad ?

H4ad commented 5 months ago

Benchmark for yauzl-promise:

Files Read: 20001
node bench-yauzl-promise.mjs   0,36s  user 0,34s system 193% cpu 0,363 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                84 MB
page faults from disk:     1
other page faults:         33186

The slowdown is probably caused by the usage of Generators, so it's not a huge problem. The only thing I didn't understand is why it takes more memory.

kibertoad commented 5 months ago

@H4ad can you create a pr with benchmarks put to a subfolder? useful thing to have

H4ad commented 5 months ago

Sure: https://github.com/kibertoad/unzipomatic/pull/3

H4ad commented 5 months ago

Well, based on our discussion, I think the yauzl will be our choice.

If so, then we just need to create a PR with the initial code.

I will create more issues to then address refactoring of code, refactoring of tests, and discussing new API (since we will probably avoid callbacks).

kibertoad commented 5 months ago

Yup, yauzl as a primary starting point makes sense, we can use other libraries as a point of reference and borrow from them if there is something valuable there that could be reused.

I will mention this in the readme.

kibertoad / unzipomatic

Choose a lib to start from #2