thejoshwolfe / yazl

yet another zip library for node
MIT License
329 stars 44 forks source link

Support for archive + file comments #44

Closed mojodna closed 5 years ago

mojodna commented 5 years ago

This adds support (and tests) for archive and file comments. Archive comments are set by providing comment in .end()'s options. File comments are set by providing fileComment when adding files.

thejoshwolfe commented 5 years ago

Thanks for the PR. What's the usecase for adding comments?

This has been proposed and rejected before in #15. I'm open to adding support for this if there's a real usecase.

mojodna commented 5 years ago

Short version: application-specific metadata as part of the central directory.

I’m building out the JavaScript tooling for https://medium.com/@mojodna/tapalcatl-cloud-optimized-tile-archives-1db8d4577d92, particularly the archive generation piece. A preliminary survey of unzip libraries across multiple platforms showed support for comments and they fit the use-case really well, since a single read gets both the central directory and application-specific metadata.

The facilitating library is here: https://github.com/mojodna/tapalcatl-js

Thanks to RandomAccessReader, doing block-aligned remote reads was totally straightforward to implement. Thanks!

thejoshwolfe commented 5 years ago

Sounds like a compelling usecase to me!

And thanks for this high-quality PR! I'll work on merging this tonight.

mojodna commented 5 years ago

Awesome, thanks!

thejoshwolfe commented 5 years ago

So here's something disappointing about the zipfile spec and ecosystem: technically the zipfile comment is supposed to be encoded in CP437, but in practice I've only seen software that treats it as UTF-8. The overlap between UTF-8 and CP437 is the codepoints 0x20...0x7e (printable ASCII, no whitespace except single space ' '), which means if you restrict your file comment to those characters, everything will work. If you don't, things might work?

Do you need newlines (or hard tabs, or other ascii control characters, or non-ascii unicode characters) in your zipfile comment? For example, the python json.dumps(obj) will meet this restriction, but JavaScript's JSON.stringify(obj) will not (because it allows non-ascii unicode characters), and any pretty-printed json will also not meet this restriction, because of newlines.

Note that the file comments on entries in the zipfile can be UTF-8, so there's no problem there.

thejoshwolfe commented 5 years ago

oops. accidentally merged. .... this PR probably won't reopen will it :thinking:

thejoshwolfe commented 5 years ago

Merged in #45. Thanks again @mojodna!

thejoshwolfe commented 5 years ago

@mojodna I published yazl version 2.5.0 which includes this feature. Please let me know how it works for you.

The CP437 situation is pretty silly. I'm going to great lengths to adhere to an obscure part of the spec that no one else seems to care about. I'm seriously considering giving up on cp437 entirely both in yazl and yauzl. :thinking:

andrewrk commented 5 years ago

The CP437 situation is pretty silly. I'm going to great lengths to adhere to an obscure part of the spec that no one else seems to care about. I'm seriously considering giving up on cp437 entirely both in yazl and yauzl. :thinking:

I have a suggestion:

Create your own zip file specification. Give it a new catchy name something like "Frenzied Zip File Format Specification" and make the tweaks you want, such as clearing up ambiguity, declaring that cp437 is only used in certain circumstances, whatever. Then make it clear that most zip file readers are actually adhering to Frenzied Zip File Format Specification anyway, not the "official" specification. Make a blog post talking about the problems with the official specs, and promote your fork spec, and point out that most people are using your spec anyway, we should just all admit it and agree this is the new standard.

thejoshwolfe commented 5 years ago

The Mac Archive Utility can't read self extracting zip files. So not everyone is on the same page in the wild.

I guess I could just have a special section for Archive Utility zipfiles, since they're so different anyway. A blog post is something I've been meaning to do for years though.

mojodna commented 5 years ago

@thejoshwolfe awesome, thanks so much!

I did a bit of experimentation to see how various tools / libs handle UTF-8 encoded comments. The short version is that all the ones I've tested just pass the bytes through (especially Python 3, where it comes back as bytes) and don't worry about their encoding.

Archive creator:

var fs = require("fs");
var yazl = require("yazl");
var zipfile = new yazl.ZipFile();
zipfile.outputStream.pipe(fs.createWriteStream("test.zip"));
zipfile.addFile("package.json", "package.json", {
  fileComment: "abcd⎋⌘👋"
});
zipfile.end({ comment: Buffer.from("abcd⎋⌘👋") })

unzip:

$ unzip -v test.zip
Archive:  test.zip
abcd⎋⌘👋
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     694  Defl:N      333  52% 11-17-2018 17:26 1dd024d9  package.json
abcd⎋⌘👋
--------          -------  ---                            -------
     694              333  52%                            1 file

Python 2:

>>> import zipfile
>>> z = zipfile.ZipFile("test.zip")
>>> z.comment
'abcd\xe2\x8e\x8b\xe2\x8c\x98\xf0\x9f\x91\x8b'
>>> print z.comment
abcd⎋⌘👋

Python 3:

>>> import zipfile
>>> z = zipfile.ZipFile("test.zip")
>>> z.comment
b'abcd\xe2\x8e\x8b\xe2\x8c\x98\xf0\x9f\x91\x8b'
>>> print(z.comment)
b'abcd\xe2\x8e\x8b\xe2\x8c\x98\xf0\x9f\x91\x8b'
>>> print(z.comment.decode("utf-8"))
abcd⎋⌘👋

JDK 8 (via Scala 2.12.7):

scala> import java.util.zip.ZipFile
import java.util.zip.ZipFile

scala> val z = new ZipFile("test.zip")
z: java.util.zip.ZipFile = java.util.zip.ZipFile@1d23ff23

scala> z.getComment()
res1: String = abcd⎋⌘👋