Native zip support - Githubissues

arcanis commented 1 year ago

What is the problem this feature will solve?

This issue is to test the water a little bit.

Node supports zlib natively, which takes care of the compression / decompression aspect of sharing files, but it doesn’t include any primitive allowing to manipulate file archives of any kind. It can be implemented in userland (and in a project of mine we’re doing this by compiling the libzip to wasm), but it’s significantly slower (both in runtime, boot time, and size).

Being able to read file archives is especially useful when coupled with the ESM loaders, since it then becomes possible to implement loading modules straight from packaged vendors - a more performant alternative to bundled packages which doesn’t require to precompile the code and allows distributing it along with non-JS files.

What is the feature you are proposing to solve the problem?

I’d like to investigate whether it’d make sense to implement a basic archive manipulation library in Node.js. This isn’t unheard of, multiple interpreters having similar capabilities, sometimes even supporting multiple formats:

PHP - https://www.php.net/manual/en/zip.examples.php
Python - https://docs.python.org/3/library/zipfile.html
Python (tar) - https://docs.python.org/3/library/tarfile.html
Perl - https://perldoc.perl.org/IO::Compress::Zip
Perl (tar) - https://perldoc.perl.org/Archive::Tar
Go - https://pkg.go.dev/archive/zip@go1.19.3
Go (tar) - https://pkg.go.dev/archive/tar@go1.19.3
Java - https://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html
C# - https://learn.microsoft.com/en-us/dotnet/standard/io/how-to-compress-and-extract-files

The API of this basic archive library would be to define, but generally we can assume some simple requirements:

Read an archive’s directory listing
Stat files from within an archive
Read a buffer from an archive
Add a file to an archive

I’d be interested to explore an implementation myself.

What alternatives have you considered?

No response

@nodejs/zlib

bnoordhuis commented 1 year ago

You should probably also enumerate what zip file features you think are in or out of scope. To name a few: large archive support, encryption, bzip2/lzma compression, etc.

arcanis commented 1 year ago

I'd tend to see the following:

Filesystem access (👎 out of scope)

Node already has primitives allowing to access the filesystem, so the native archive implementation shouldn't open/write files directly. It should only accept and output raw buffers, leaving it up to the user to retrieve them (usually with readFile).

A post-mvp follow-up might be to make it work with fs.open file descriptors, if we see significant improvements in perfs by doing so (I can imagine it being useful when you only want to read a couple of files in a large zip archive).

Arbitrary compression (👍 in scope)

Generating data would require to provide the API with pre-compressed buffers (and algorithm), whereas reading data would yield the compressed data - that users would then uncompress themselves. This would also make it possible for users to generate stable archives (same input => same output), which is more difficult if the compression is left to Node.

Would it be worthwhile to have a fast path for deflate, which is the most common compression out there?

Stat / External attributes (👍 in scope)

The API should offer a way to set the files' stat, including external attributes (required for symlink support).

Encryption (🔮 post-mvp scope)

It should be technically fairly easy to do via libzip. Main consideration would be how basic we want the first iteration to be (if we want to start small with only the features that 95% of consumers would use, then encryption might not be needed).

Large Archives (🔮 post-mvp scope)

Zip is limited to 32 bits in a couple of places, but libzip supports Zip 64 which doesn't have those problems. Same thing as the previous point, I'd tend to see it as something that we could expose, but that I'd see as a second iteration (although unlike encryption it probably doesn't impact the public API much).

Splitting a single archive into multiple files (.zip.001, .zip.002, ...) is out of scope.

GeoffreyBooth commented 1 year ago

I just discovered this issue. Here’s what I’m currently doing; I’d love to be able to do this with zlib instead. My use case is generating a .zip file for upload to AWS lambda, and as this is part of a CI script it would be convenient to be able to do this without any third-party dependencies.

import { readFile } from 'node:fs/promises'
import jszip from 'jszip'

const code = await readFile('./index.js', 'utf8')
const zip = new jszip()
zip.file('index.js', code, { unixPermissions: 755 })
const zipBuffer = await zip.generateAsync({
  type: 'nodebuffer',
  compression: 'DEFLATE',
  platform: 'UNIX',
})

richajak commented 1 year ago

Another approach is by spawning a native shell tool from Linux. In this way, we do not have to install any third-party js dependencies.

first, install the zip tool (e.g. debian), $ sudo apt-get install zip

then, in the code, simply call execSync('zip test.zip test.txt')

https://packages.debian.org/bullseye/zip https://exploringjs.com/nodejs-shell-scripting/ch_nodejs-child-process.html#synchronous-helper-functions-based-on-spawnasync

MoLow commented 1 year ago

Another approach is by spawning a native shell tool from Linux. In this way, we do not have to install any third-party js dependencies.

that will only work for linux/unix , and only assuming the tool is installed - which is not the case for many ditros

arcanis commented 1 year ago

I started a prototype implementation over at https://github.com/nodejs/node/pull/45651

mscdex commented 1 year ago

-1 I know this is an unpopular opinion, but I'm not convinced this should live in node core. Sure it's a common file format, but then again so are tar, rar, 7z, zstd, xz, and others, which also don't belong in node core.

I feel like adding such modules to node core is further leading to feature creep. I get that other platforms like PHP and such may have zip modules, but they also include a ton of other modules that make them "kitchen sink" platforms, which I would hate to see node.js become.

arcanis commented 1 year ago

I understand the general concern, and I wouldn't want to increase the Node.js surface if the value wasn't well identified. But while some modules can happen in userland, I think archive support (whether it's zip or something else) is important:

Archive manipulation can be a security risk if done incorrectly. As far as I know, most JS implementations of zip / tar parsing have been vulnerable through their history to some sort of path traversal attack (some recently). At times they even made their ways into package managers - and some of these vulnerabilities were reported through the Node.js security program. Having a trustable common implementation in the core would be beneficial to the ecosystem's security.
It answers a fairly common need: how to store multiple files / data structures without encumbering the filesystem? This need becomes more pressing now with the ESM loaders on their way to graduation: loading modules from archives rather than the filesystem will be a common pattern part of the critical bootstrap path (Yarn can already generate such a loader when requested).
- Another area of exploration - early, but promising - are packaged Node.js applications. For example node-sea already allows packing a full package into a single self-executable Node.js script (using the WASM Zip library I mention below).
Parsing archives is one thing, but doing efficiently is another. While WASM is an option (it's what we currently use), it makes the startup slower, blocks the shutdown, the memory footprint much larger, increases the risks of memory leaks, and GC integration is subpar at best. As I mentioned, the need to read from archives will be a regular occurence in Node.js loaders, and primitives making this work easier and more stable would be welcomed.

To summarize, I think an archive manipulation library would have a reasonable amount of practical use cases, for a comparatively low cost. It may not be a feature commonly used by end-users, but I expect tooling authors to benefit from it in ways that will still benefit end-users.

Sure it's a common file format, but then again so are tar, rar, 7z, zstd, xz, and others, which also don't belong in node core.

I don't see it as a slippery slope situation - this feature request is about providing efficient support for a use case that cannot be implemented efficiently today. Once this use case is fulfilled, regardless which format is picked, you will then be able to point out to future requesters that Node.js already supports an archive format well - which isn't the case today.

As for why zip, I went with it for a couple of reasons:

It's a format integrated with most OS distributions, interpreters, and text editors.
It supports various compression algorithms (including none) on a per-file basis.
It offers random access capabilities, which isn't the case of tar, and is important for the loader / packaged app use case.

Other formats may have better compression, but since the goal of this work is first and foremost to address use cases, I went with the format that offered a good compromise between popularity and feature set.

tniessen commented 1 year ago

Archive manipulation can be a security risk if done incorrectly. (...) Having a trustable common implementation in the core would be beneficial to the ecosystem's security.

I'm pretty sure we've heard this argument for just about any proposed feature. In this issue alone, the same thing can be said about tar, xz, etc., which is the slippery slope that @mscdex mentioned.

Parsing archives is one thing, but doing efficiently is another. While WASM is an option (it's what we currently use), it makes the startup slower, the memory footprint much larger, increases the risks of memory leaks, and GC integration is subpar at best.

This can, again, be said about everything that's best implemented as a native C++ addon right now. Native addons are a burden, and WebAssembly can be a meaningful alternative, but we can't add everything to Node.js that would otherwise require a native addon.

Once this use case is fulfilled, regardless which format is picked, you will then be able to point out to future requesters that Node.js already supports an archive format well

I might be mistaken but I generally consider Linux-like operating systems the primary domain of Node.js, maybe with the exception of Electron apps. And isn't zip terrible at preserving unix file attributes, for example? I don't see how we could possibly reject requests for tar or other archive formats by pointing to zip.

arcanis commented 1 year ago

This can, again, be said about everything that's best implemented as a native C++ addon right now. Native addons are a burden, and WebAssembly can be a meaningful alternative, but we can't add everything to Node.js that would otherwise require a native addon.

I'm not aware of other Node.js C++ addons that would pretend to be used in the critical path of loading an entire Node.js application. Please consider the arguments I exposed as a whole - I'm sure we can find reasons why any single one wouldn't be a good enough reason if alone, but once taken all together I find they make a lot more sense.

And isn't zip terrible at preserving unix file attributes, for example?

Symlinks can be represented just fine, mtime as well, and chmod equally. It doesn't support uid / gid / dev, but I never found that a limitation - at least not one I saw when I used it in my tools, which are used in production across all systems. And since you mention Electron apps, ASAR archives don't contain any of these metadata, with the exception of the +x bit. That seems to satisfy their needs just fine.

To be clear, I'm not rejecting tar, or any other format. As I said, since the goal of this work is first and foremost to address use cases, I went with the format that offered a good compromise between popularity and feature set. Random access is a very important feature, as it's what supports the use cases I described.

GeoffreyBooth commented 1 year ago

For me this feels like a missing feature that we should add. We have utilities for all sorts of compression and decompression algorithms in zlib; I thought for quite a while that some combination of them would work for creating a .zip file, and I was surprised and frustrated when I discovered that it was impossible. You could argue that maybe we never should have added zlib, but we did, and therefore I think that we should support what’s one of the most popular archive formats in use.

mscdex commented 1 year ago

We have utilities for all sorts of compression and decompression algorithms in zlib

I believe the original reason for adding a zlib binding was for the purposes of compressing HTTP responses, not just for the sake of having common compression formats available. Now you might argue the relevancy of deflate for HTTP responses in modern day HTTP servers, but that was most likely why it was originally added (back in 2011).

If zlib had some kind of built-in support for zip, that would be one thing and would mean we wouldn't need to pull in an additional dependency and such, but that's not the case today.

mscdex commented 1 year ago

I should add that there is minizip in the zlib source code, but I'm not entirely sure of the quality/reliability of that code (since it lives in the contrib/ dir) and/or if its feature set/performance would suffice for everyone interested in zip support.

Additionally I'm not sure if the Chromium team would want to keep the minizip files up to date since I presume they're not actively using them? Otherwise we'd have to pull in minizip changes from the madler/zlib repo.

arcanis commented 1 year ago

I believe the original reason for adding a zlib binding was for the purposes of compressing HTTP responses, not just for the sake of having common compression formats available.

And the reason for adding an archive binding is for the purpose of efficiently storing and transferring multiple files inside portable packages, not just for the sake of having common archive formats available 🙂

If zlib had some kind of built-in support for zip, that would be one thing and would mean we wouldn't need to pull in an additional dependency and such, but that's not the case today.

The zlib project only covers the deflate algorithm, not file archives or other compression algorithms[^1], so the minizip in their contrib folder has little to no maintenance; its very website suggests minizip-ng instead (or libzip, which is used in both Chrome and the PHP-Zip library).

[^1]: Which wasn't a problem in #18964, where Brotli support was added via a new dep on brotli, and whose folder received a single commit since then - it seems to support that file formats have relatively low maintenance cost.

mscdex commented 1 year ago

And the reason for adding an archive binding is for the purpose of efficiently storing and transferring multiple files inside portable packages

HTTP was a core feature of node from the beginning, storing and transferring multiple files inside a portable package was not.

Which wasn't a problem in Feature Idea: Brotli support in core #18964, where Brotli support was added via a new dep on brotli, and whose folder received a single commit since then - it seems to support that file formats have relatively low maintenance cost

For the record, I wasn't keen on adding brotli either, for similar reasons as zip.

tniessen commented 1 year ago

I'm not aware of other Node.js C++ addons that would pretend to be used in the critical path of loading an entire Node.js application. Please consider the arguments I exposed as a whole - I'm sure we can find reasons why any single one wouldn't be a good enough reason if alone, but once taken all together I find they make a lot more sense.

And the reason for adding an archive binding is for the purpose of efficiently storing and transferring multiple files inside portable packages, not just for the sake of having common archive formats available 🙂

If the experimental SEA implementation moves forward to the point where Node.js internally needs support for any specific archive format, be it ASAR, tar, or even zip, that is still not the same as adding a public API for a specific archive format. If there is a requirement for user code to access archive members when packaged as some SEA, that also does not necessarily require a general-purpose API for said archive format.

arcanis commented 1 year ago

SEA-related experiments are a use case, not the use case. This issue is about the general need, not an Node.js-internal one.

bnoordhuis commented 1 year ago

Now you might argue the relevancy of deflate for HTTP responses in modern day HTTP servers, but that was most likely why it was originally added (back in 2011).

Specifically, it was added so npm could do its thing.

(Hi, node's living memory here!)

tniessen commented 1 year ago

I just discovered this issue. Here’s what I’m currently doing; I’d love to be able to do this with zlib instead.

We have utilities for all sorts of compression and decompression algorithms in zlib; I thought for quite a while that some combination of them would work for creating a .zip file, and I was surprised and frustrated when I discovered that it was impossible. You could argue that maybe we never should have added zlib, but we did, and therefore I think that we should support what’s one of the most popular archive formats in use.

IMHO, the fact that archives (tar) or members of archives (zip) are commonly compressed does not mean that supporting compression is any justification for supporting archives.

GeoffreyBooth commented 1 year ago

IMHO, the fact that archives (tar) or members of archives (zip) are commonly compressed does not mean that supporting compression is any justification for supporting archives.

I think the average user probably doesn’t grasp the distinction between compressed data and a compressed archive. I was one such user who thought that zlib could do what I wanted; it even has an unzip function.

We don’t have clear criteria (that I’m aware of) for what belongs in core versus not, and lately we’ve been relaxing a bit from the project’s earlier “small core” devotion. It’s obviously always debatable, but this does have several aspects that usually lean in favor of inclusion: it’s something that benefits from native compilation, it’s related to existing APIs, and it covers a use case commonly handled in scripts (where importing third-party dependencies is cumbersome). Yes Node scripts could call their platforms’ zip utility but having this functionality included ensures it’s both available and will work cross-platform.

Ultimately it’s a subjective call whether it’s useful enough, and I think we usually settle decisions like those via votes since that’s a hard question to reach consensus on.

tniessen commented 1 year ago

I think the average user probably doesn’t grasp the distinction between compressed data and a compressed archive. I was one such user who thought that zlib could do what I wanted; it even has an unzip function.

I don't know if this really is a common misconception. I don't remember seeing many feature requests related to this.

(node:zlib should already be sufficient for managing every aspect of compression of tar archives, only the archive format itself must be implemented. zip, on the other hand, isn't a compressed archive but rather an archive of (potentially) compressed files, so it is likely more difficult to implement based on node:zlib.)

it covers a use case commonly handled in scripts (where importing third-party dependencies is cumbersome). Yes Node scripts could call their platforms’ zip utility but having this functionality included ensures it’s both available and will work cross-platform.

Scripts are an interesting use case. Again, I have no data to base this claim on, but I would assume that Node.js scripts much more often run on non-Windows platforms than on Windows. On non-Windows platforms, tar is almost universally available whereas zip often is not. (On Windows, unfortunately, the opposite is true.)

Consider, for example, the scripts that produce Node.js builds -- the Node.js 19.2.0 download directory contains 19 tar files (source archives, binaries for non-Windows platforms) and only 2 zip files (binaries for Windows).

This does not seem in line with @arcanis's earlier argument:

Once this use case is fulfilled, regardless which format is picked, you will then be able to point out to future requesters that Node.js already supports an archive format well - which isn't the case today.

richardlau commented 1 year ago

Scripts are an interesting use case. Again, I have no data to base this claim on, but I would assume that Node.js scripts much more often run on non-Windows platforms than on Windows. On non-Windows platforms, tar is almost universally available whereas zip often is not. (On Windows, unfortunately, the opposite is true.)

FWIW Windows 10 got tar some years ago:

Neustradamus commented 1 year ago

To follow

github-actions[bot] commented 1 year ago

There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.

For more information on how the project manages feature requests, please consult the feature request management document.

arcanis commented 1 year ago

I still want to bring it to the finish line; last blocker so far is a memory leak reported by Asan which I found difficult to investigate. I'll make another attempt in the coming weeks.

tniessen commented 1 year ago

@arcanis I don't think the memory leak is the main issue. There simply is no consensus that this should be added.

arcanis commented 1 year ago

I think both pros and cons have been clearly stated, yes. As you say, consensus on yes/no hasn't been reached yet. My understanding is that in cases like this whether to merge it or not will be a TSC decision.

Given that the major concerns you and other have raised are around scope and maintainability, I'd prefer to avoid theoretical back and forth, and have a working prototype showing exactly what's the envisioned scope and maintenance burden At the very least, this might prove I'm committed to maintain it myself on the long term, and be enough to address some concerns on this side.

Worst case, I lose some of my time, but I believe this feature is worth giving a real try, and seeing the upvotes, I'm not alone 🙂

(In any case, my previous comment was mostly intended to avoid stale close, not to add new elements to the discussion)

github-actions[bot] commented 7 months ago

There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.

For more information on how the project manages feature requests, please consult the feature request management document.

Elkfoot2 commented 6 months ago

I would very much like to see this feature.

github-actions[bot] commented 3 weeks ago

There has been no activity on this feature request for 5 months. To help maintain relevant open issues, please add the https://github.com/nodejs/node/labels/never-stale label or close this issue if it should be closed. If not, the issue will be automatically closed 6 months after the last non-automated comment. For more information on how the project manages feature requests, please consult the feature request management document.

BurningEnlightenment commented 6 days ago

🪄 unstale: The basic feature set outlined in the OP would cover my needs and would allow me to drop a native addon dependency.

nodejs / node

Native zip support #45434

What is the problem this feature will solve?

What is the feature you are proposing to solve the problem?

What alternatives have you considered?

Filesystem access (👎 out of scope)

Arbitrary compression (👍 in scope)

Stat / External attributes (👍 in scope)

Encryption (🔮 post-mvp scope)

Large Archives (🔮 post-mvp scope)