gildas-lormeau / zip.js

JavaScript library to zip and unzip files supporting multi-core compression, compression streams, zip64, split files and encryption.
https://gildas-lormeau.github.io/zip.js
BSD 3-Clause "New" or "Revised" License
3.38k stars 510 forks source link

Archived filenames are not read properly. #352

Closed agr closed 2 years ago

agr commented 2 years ago

Using zip.js from npm:

    "@zip.js/zip.js": "^2.6.12"

Very simple code:

    const url = 'https://globalcdn.nuget.org/packages/newtonsoft.json.13.0.1.nupkg';
    const httpReader = new zip.HttpRangeReader(url);
    httpReader.reader.forceRangeRequests = true;   // Accept-Ranges is not exposed through CORS request, so need some overrides
    const zipReader = new zip.ZipReader(httpReader);
    const entries = await zipReader.getEntries();

yields broken filenames in the entries array:

image

However, if I manually decode rawFilename, I get proper value:

image

How do I fix it?

gildas-lormeau commented 2 years ago

I tested your code in Firefox, Chrome, Microsoft Edge (on Win11, Ubuntu 20 and Android 12) and I cannot reproduce the issue, see https://jsfiddle.net/uvmyq3b9/. Do you run the code in a particular environment?

FYI, the filenames in the zip file of your code are encoded in CP437. Thus, the decoding of the data into UTF-8 is handled by the function exported in /lib/core/util/cp437-decode.js. It looks like CP437 is corrupted in your environment.

https://github.com/gildas-lormeau/zip.js/blob/876c084cebe48bcf6f37085481d2f839cbc2d4f9/lib/core/util/cp437-decode.js#L28-L30

agr commented 2 years ago

Windows 10, the code is loaded from nginx running on localhost. I use zip-full.js from the dist directory. Also tried all browsers, the behavior is the same everywhere. Will experiment more.

agr commented 2 years ago

Oh, maybe the file itself is not treated as UTF-8 when served by nginx...

gildas-lormeau commented 2 years ago

Can you reproduce the problem with the test on jsfiddle? The file should be served as binary data by nginx. If the encoding was invalid, zip.js wouldn't be able to find the entries in the file.

agr commented 2 years ago

No, jsfiddle works fine. This is the content type I get from unpkg.com: image This is the content type I get from my nginx: image That missing charset is likely the cause.

gildas-lormeau commented 2 years ago

Sorry, I thought you were referring to the zip file. It's strange though, that would mean the JS code is not parsed in UTF-8.

agr commented 2 years ago

Yes, nginx was not configured to specify charset and apparently browsers default to interpret JS code using some code page other than UTF-8, which causes JS interpreter to incorrectly read the value of CP437 (since it contains multi-byte (in UTF-8) characters right from the start, it treats individual bytes as a character, which eventually messes up decoding). I enforced UTF-8 in nginx and it fixed the issue.

gildas-lormeau commented 2 years ago

I'm glad to hear it! Maybe I should add a note in the documentation about this requirement.

gildas-lormeau commented 2 years ago

I added a note, I think I can close the issue now. Thank you for pointing this out.

agr commented 2 years ago

Replacing that "...".split("") with something like this:

const CP437 = [String.fromCharCode(0),String.fromCharCode(9786),String.fromCharCode(9787),String.fromCharCode(9829),String.fromCharCode(9830),String.fromCharCode(9827),String.fromCharCode(9824),String.fromCharCode(8226),String.fromCharCode(9688),String.fromCharCode(9675),String.fromCharCode(9689),String.fromCharCode(9794),String.fromCharCode(9792),String.fromCharCode(9834),String.fromCharCode(9835),String.fromCharCode(9788),String.fromCharCode(9658),String.fromCharCode(9668),String.fromCharCode(8597),String.fromCharCode(8252),String.fromCharCode(182),String.fromCharCode(167),String.fromCharCode(9644),String.fromCharCode(8616),String.fromCharCode(8593),String.fromCharCode(8595),String.fromCharCode(8594),String.fromCharCode(8592),String.fromCharCode(8735),String.fromCharCode(8596),String.fromCharCode(9650),String.fromCharCode(9660),String.fromCharCode(32),String.fromCharCode(33),String.fromCharCode(34),String.fromCharCode(35),String.fromCharCode(36),String.fromCharCode(37),String.fromCharCode(38),String.fromCharCode(39),String.fromCharCode(40),String.fromCharCode(41),String.fromCharCode(42),String.fromCharCode(43),String.fromCharCode(44),String.fromCharCode(45),String.fromCharCode(46),String.fromCharCode(47),String.fromCharCode(48),String.fromCharCode(49),String.fromCharCode(50),String.fromCharCode(51),String.fromCharCode(52),String.fromCharCode(53),String.fromCharCode(54),String.fromCharCode(55),String.fromCharCode(56),String.fromCharCode(57),String.fromCharCode(58),String.fromCharCode(59),String.fromCharCode(60),String.fromCharCode(61),String.fromCharCode(62),String.fromCharCode(63),String.fromCharCode(64),String.fromCharCode(65),String.fromCharCode(66),String.fromCharCode(67),String.fromCharCode(68),String.fromCharCode(69),String.fromCharCode(70),String.fromCharCode(71),String.fromCharCode(72),String.fromCharCode(73),String.fromCharCode(74),String.fromCharCode(75),String.fromCharCode(76),String.fromCharCode(77),String.fromCharCode(78),String.fromCharCode(79),String.fromCharCode(80),String.fromCharCode(81),String.fromCharCode(82),String.fromCharCode(83),String.fromCharCode(84),String.fromCharCode(85),String.fromCharCode(86),String.fromCharCode(87),String.fromCharCode(88),String.fromCharCode(89),String.fromCharCode(90),String.fromCharCode(91),String.fromCharCode(92),String.fromCharCode(93),String.fromCharCode(94),String.fromCharCode(95),String.fromCharCode(96),String.fromCharCode(97),String.fromCharCode(98),String.fromCharCode(99),String.fromCharCode(100),String.fromCharCode(101),String.fromCharCode(102),String.fromCharCode(103),String.fromCharCode(104),String.fromCharCode(105),String.fromCharCode(106),String.fromCharCode(107),String.fromCharCode(108),String.fromCharCode(109),String.fromCharCode(110),String.fromCharCode(111),String.fromCharCode(112),String.fromCharCode(113),String.fromCharCode(114),String.fromCharCode(115),String.fromCharCode(116),String.fromCharCode(117),String.fromCharCode(118),String.fromCharCode(119),String.fromCharCode(120),String.fromCharCode(121),String.fromCharCode(122),String.fromCharCode(123),String.fromCharCode(124),String.fromCharCode(125),String.fromCharCode(126),String.fromCharCode(8962),String.fromCharCode(199),String.fromCharCode(252),String.fromCharCode(233),String.fromCharCode(226),String.fromCharCode(228),String.fromCharCode(224),String.fromCharCode(229),String.fromCharCode(231),String.fromCharCode(234),String.fromCharCode(235),String.fromCharCode(232),String.fromCharCode(239),String.fromCharCode(238),String.fromCharCode(236),String.fromCharCode(196),String.fromCharCode(197),String.fromCharCode(201),String.fromCharCode(230),String.fromCharCode(198),String.fromCharCode(244),String.fromCharCode(246),String.fromCharCode(242),String.fromCharCode(251),String.fromCharCode(249),String.fromCharCode(255),String.fromCharCode(214),String.fromCharCode(220),String.fromCharCode(162),String.fromCharCode(163),String.fromCharCode(165),String.fromCharCode(8359),String.fromCharCode(402),String.fromCharCode(225),String.fromCharCode(237),String.fromCharCode(243),String.fromCharCode(250),String.fromCharCode(241),String.fromCharCode(209),String.fromCharCode(170),String.fromCharCode(186),String.fromCharCode(191),String.fromCharCode(8976),String.fromCharCode(172),String.fromCharCode(189),String.fromCharCode(188),String.fromCharCode(161),String.fromCharCode(171),String.fromCharCode(187),String.fromCharCode(9617),String.fromCharCode(9618),String.fromCharCode(9619),String.fromCharCode(9474),String.fromCharCode(9508),String.fromCharCode(9569),String.fromCharCode(9570),String.fromCharCode(9558),String.fromCharCode(9557),String.fromCharCode(9571),String.fromCharCode(9553),String.fromCharCode(9559),String.fromCharCode(9565),String.fromCharCode(9564),String.fromCharCode(9563),String.fromCharCode(9488),String.fromCharCode(9492),String.fromCharCode(9524),String.fromCharCode(9516),String.fromCharCode(9500),String.fromCharCode(9472),String.fromCharCode(9532),String.fromCharCode(9566),String.fromCharCode(9567),String.fromCharCode(9562),String.fromCharCode(9556),String.fromCharCode(9577),String.fromCharCode(9574),String.fromCharCode(9568),String.fromCharCode(9552),String.fromCharCode(9580),String.fromCharCode(9575),String.fromCharCode(9576),String.fromCharCode(9572),String.fromCharCode(9573),String.fromCharCode(9561),String.fromCharCode(9560),String.fromCharCode(9554),String.fromCharCode(9555),String.fromCharCode(9579),String.fromCharCode(9578),String.fromCharCode(9496),String.fromCharCode(9484),String.fromCharCode(9608),String.fromCharCode(9604),String.fromCharCode(9612),String.fromCharCode(9616),String.fromCharCode(9600),String.fromCharCode(945),String.fromCharCode(223),String.fromCharCode(915),String.fromCharCode(960),String.fromCharCode(931),String.fromCharCode(963),String.fromCharCode(181),String.fromCharCode(964),String.fromCharCode(934),String.fromCharCode(920),String.fromCharCode(937),String.fromCharCode(948),String.fromCharCode(8734),String.fromCharCode(966),String.fromCharCode(949),String.fromCharCode(8745),String.fromCharCode(8801),String.fromCharCode(177),String.fromCharCode(8805),String.fromCharCode(8804),String.fromCharCode(8992),String.fromCharCode(8993),String.fromCharCode(247),String.fromCharCode(8776),String.fromCharCode(176),String.fromCharCode(8729),String.fromCharCode(183),String.fromCharCode(8730),String.fromCharCode(8319),String.fromCharCode(178),String.fromCharCode(9632),String.fromCharCode(32)];

removes the dependency on the served code page at least in this case.

gildas-lormeau commented 2 years ago

I agree but it makes the built files significantly bigger.

agr commented 2 years ago

Well, I guess a string with \x codes should also work, give me a few moments to produce one.

gildas-lormeau commented 2 years ago

That would be useful only when using built files. I provide them but, in 2022, I expect most developers will import the library as an ES module. The HTML5 specification mandates the module files to be parsed in UTF-8 (see step 14 here: https://html.spec.whatwg.org/#fetch-a-single-module-script) so I'm tempted to say this issue is out of the scope of the library.

agr commented 2 years ago
const CP437 = "\u0000\u263a\u263b\u2665\u2666\u2663\u2660\u2022\u25d8\u25cb\u25d9\u2642\u2640\u266a\u266b\u263c\u25ba\u25c4\u2195\u203c\u00b6\u00a7\u25ac\u21a8\u2191\u2193\u2192\u2190\u221f\u2194\u25b2\u25bc !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u2302\u00c7\u00fc\u00e9\u00e2\u00e4\u00e0\u00e5\u00e7\u00ea\u00eb\u00e8\u00ef\u00ee\u00ec\u00c4\u00c5\u00c9\u00e6\u00c6\u00f4\u00f6\u00f2\u00fb\u00f9\u00ff\u00d6\u00dc\u00a2\u00a3\u00a5\u20a7\u0192\u00e1\u00ed\u00f3\u00fa\u00f1\u00d1\u00aa\u00ba\u00bf\u2310\u00ac\u00bd\u00bc\u00a1\u00ab\u00bb\u2591\u2592\u2593\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b\u2510\u2514\u2534\u252c\u251c\u2500\u253c\u255e\u255f\u255a\u2554\u2569\u2566\u2560\u2550\u256c\u2567\u2568\u2564\u2565\u2559\u2558\u2552\u2553\u256b\u256a\u2518\u250c\u2588\u2584\u258c\u2590\u2580\u03b1\u00df\u0393\u03c0\u03a3\u03c3\u00b5\u03c4\u03a6\u0398\u03a9\u03b4\u221e\u03c6\u03b5\u2229\u2261\u00b1\u2265\u2264\u2320\u2321\u00f7\u2248\u00b0\u2219\u00b7\u221a\u207f\u00b2\u25a0 ".split("");

That should work.

gildas-lormeau commented 2 years ago

It should indeed work but when I build the project (npm run build), terser transforms the string back in minified files, i.e. they are strictly identical.

nenge123 commented 2 years ago

If you are worried about file coding, Preferably. new Function(new TextDecoder().decode(new Uint8array(await (await fetch("zip.js")).arrayBuffer())))();

or

script.src = new File([new TextDecoder().decode(new Uint8array(await (await fetch("zip.js")).arrayBuffer()))],"zip.js",{type:"text/javascript"})

gildas-lormeau commented 2 years ago

Thank you @nenge123 for the suggestion.

I might be able to detect in zip.js when this error occurs and use TextDecoder as a fallback. I'll do some tests.

gildas-lormeau commented 2 years ago

It's fixed in the version 2.6.13 I just published.