gildas-lormeau / zip.js

JavaScript library to zip and unzip files supporting multi-core compression, compression streams, zip64, split files and encryption.
https://gildas-lormeau.github.io/zip.js
BSD 3-Clause "New" or "Revised" License
3.38k stars 510 forks source link

Replace CP437 characters with Unicode equivalents #379

Closed andrewgoz closed 1 year ago

andrewgoz commented 1 year ago

I was writing a Python script to process a web page that has your script embedded in it. The Python text decoder was choking on a string that I'm pretty sure I tracked down to:

https://github.com/gildas-lormeau/zip.js/blob/eac1270b2e3b7b84eec13fd2859ca341be6f4df0/lib/core/util/cp437-decode.js#L31

No matter what Python encoding I tried, it kept on throwing an exception.

To get my script to work I replaced that line with:

const CP437 = "\0\u263a\u263b\u2665\u2666\u2663\u2660\u2022\u25d8\u25cb\u25d9\u2642\u2640\u266a\u266b\u263c\u25ba\u25c4\u2195\u203c\u00b6\u00a7\u25ac\u21a8\u2191\u2193\u2192\u2190\u221f\u2194\u25b2\u25bc !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u2302\u00c7\u00fc\u00e9\u00e2\u00e4\u00e0\u00e5\u00e7\u00ea\u00eb\u00e8\u00ef\u00ee\u00ec\u00c4\u00c5\u00c9\u00e6\u00c6\u00f4\u00f6\u00f2\u00fb\u00f9\u00ff\u00d6\u00dc\u00a2\u00a3\u00a5\u20a7\u0192\u00e1\u00ed\u00f3\u00fa\u00f1\u00d1\u00aa\u00ba\u00bf\u2310\u00ac\u00bd\u00bc\u00a1\u00ab\u00bb\u2591\u2592\u2593\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b\u2510\u2514\u2534\u252c\u251c\u2500\u253c\u255e\u255f\u255a\u2554\u2569\u2566\u2560\u2550\u256c\u2567\u2568\u2564\u2565\u2559\u2558\u2552\u2553\u256b\u256a\u2518\u250c\u2588\u2584\u258c\u2590\u2580\u03b1\u00df\u0393\u03c0\u03a3\u03c3\u00b5\u03c4\u03a6\u0398\u03a9\u03b4\u221e\u03c6\u03b5\u2229\u2261\u00b1\u2265\u2264\u2320\u2321\u00f7\u2248\u00b0\u2219\u00b7\u221a\u207f\u00b2\u25a0\u00a0";

I compared the two strings in a JavaScript console and they match, but I did notice that even when I made a mistake making my alternate string the web page unzip functionality was not affected. I assume the CP437 variable is only used in limited circumstances.

Would you consider using this alternate Unicode representation?

gildas-lormeau commented 1 year ago

The problem is that you are not reading the web page in UTF-8. This issue is similar: https://github.com/gildas-lormeau/zip.js/issues/352, JavaScript files should be parsed in UTF-8. CP437 is an obsolete encoding used on MS-DOS. This encoding is used for entry filenames in the zip file. Normally, zip files produced nowadays do not use this encoding.

andrewgoz commented 1 year ago

I was reading the file using:

with open('filename.html', 'r', encoding='utf_8') as f:

In any case, I've realised that the minifier I'm using (and probably you are too), will convert my Unicode escapes to their characters - meaning that even if you accepted my proposed change the minified code would end up the same as it is now anyway!

I will need to figure this out myself. Sorry to bother you.

gildas-lormeau commented 1 year ago

@andrewgoz That's weird, I will try to do some tests to understand why Python can't read the file. Out of curiosity, are you actually processing pages saved with SingleFileZ?

andrewgoz commented 1 year ago

Just to be clear - I have no problems at all with the use of your library - it's working great!

The problem is that I copy-pasted the contents of zip-no-worker-inflate.min.js into my web page. Then the Python script I wrote to extract a temporary copy of the embedded JavaScript for jsdoc choked trying to read that web page file.

I am not using SingleFileZ (or SingleFile).

gildas-lormeau commented 1 year ago

Thank you for the response and the kind words. I think I understand the issue. You're doing the same thing as SingleFileZ actually (the difference is that it embeds zip.js to extract the page as a zip file). I am still intrigued by the fact that this constant is problematic for Python though.