Provide an option to ignore Info-ZIP Unicode Path Extra Field

sandy081 commented 3 months ago

VS Code uses yazl & yauzl packages for zipping and unzipping extensions respectively. We do not use any extra headers while creating the zip, therefore we do not want to respect these extra headers while unzipping.

But there is no option in yauzl package to skip these headers. It would be great if you can provide an option to skip.

thejoshwolfe commented 3 months ago

A VS Code developer. Hello! I use your software every day at work. Thanks for your open source contributions! I'm happy to hear yazl and yauzl are useful for you.

It sounds like you're encountering zip files that include the Info-ZIP Unicode Path Extra Field, but you would like to ignore the extension override and use the regular file name instead (and presumably interpret it as UTF-8?). I'm very interested in how this situation came to be. Do you know which software created these zip files? Was it a buggy version of the software? What's the reason you want to ignore the file name in the extension data?

In #33, a zipfile was provided (unknown where it came from) that presumably used CP936 for the regular file name, and the Info-ZIP Unicode Path Extra Field was required to figure out the correct interpretation of the names. In the example zip file provided in that issue, General Purpose Bit 11 Language encoding flag (EFS) was not set, which according to the spec means that the filename is encoded in CP437, which is definitely wrong. The extension was required to interpret the file name correctly in that case. See also #84, which is also related to file name encoding.

If we can figure out exactly why we need a skip option, I would like to document that as justification, like how the README today about backslash normalization references a bug in ".NET versions 4.5.0 until 4.6.1".

sandy081 commented 3 months ago

@thejoshwolfe Glad that you use VS Code 🎉

Our extension packaging software, VSCE, uses yazl library to package the extension. I believe this library does not support Info-ZIP Unicode Path Extra Field header and therefore we want to be consistent and not support this header while unzipping.

thejoshwolfe commented 3 months ago

Oh interesting. I'm not sure I understand the intent. The Info-ZIP Unicode Path Extra Field is an official extension listed in the spec: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT . So if the intent is to support reading standard-compliant zipfiles, I recommend allowing the extra field. If the intent is to only support zipfiles created specifically by yazl, that would be a challenging usecase to support. Do you want to reject otherwise compliant zip files in some cases?

Or maybe a different question would be: have you found a zipfile where recognizing the extra field causes a significant difference? and do you know what software created that zip file?

If the desire is just for consistency, I recommend instead focusing on standard-compliant behavior. I think that will give better results, but I'm curious if you've found a counterexample.

sandy081 commented 3 months ago

Our intent is more about being precautious, rather than being consistent. While vsce does use yazl to package up extensions, where the ZIP feature set is kept under control, an extension can be crafted using other tools and thus use any ZIP format feature, including the Info-ZIP Unicode Path Extra Field. We'd like to limit the use of such features and the best idea we've come up with was to do it at extraction time. We understand this would be an option which most of your customers would simply ignore.

We're willing to contribute with a PR, let us know if you'd be up for it.

thejoshwolfe commented 2 months ago

My expert opinion is that you shouldn't worry about disabling the extra field. If a zip file author puts the field in there, you should probably respect it, which is the default behavior of yauzl. I'm always open to data points to the contrary, and if you've come up with this requirement based on observing unusual zip files, please let me know.

There's an example zip file in #33 that requires reading the extra field in order for the file names to be interpreted correctly. Why do you want to interpret zipfiles created by that same software incorrectly? It doesn't seem precautious; it seems incorrect to ignore the field.

There's a time and a place for reading zipfiles incorrectly. yauzl currently reads many different kinds of buggy zip files created by the Apple Archive Utility incorrectly, and I think it's the right decision, but that's based on several nuanced factors and observations. (See #69 if you're curious.) Do you have an argument for interpreting the Info-ZIP Unicode Path Extra Field incorrectly?

thejoshwolfe / yauzl

Provide an option to ignore Info-ZIP Unicode Path Extra Field #159