Closed bitsgalore closed 8 years ago
So the changes in https://github.com/openpreserve/jpylyzer/commit/b238eac2d42c09a84b7b5e81749416ee11a0604f fixed the original issue, but they created a new one. As a test I ran the modified code on a set of files whose names are made up of random unicode characters. The dataset can be found here:
https://github.com/mo/randomgit
I ran all files through the following command (using Py 2.7):
python ~/jpylyzer/jpylyzer/jpylyzer.py --wrapper * >../all.xml
Result:
File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 685, in <module>
main()
File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 681, in main
checkFiles(args.inputRecursiveFlag, args.inputWrapperFlag, jp2In)
File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 610, in checkFiles
writeElement(xmlElement, out)
File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 553, in writeElement
xmlPretty = minidom.parseString(xmlOut).toprettyxml(' ')
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 116
Since the error happens in the part of the code that does the pretty-printing, I re-ran jpylyzer with the --nopretty
switch. In that case no errors are reported, but the output still contains illegal characters. So the actual problem seems to occur earlier on.
Error is caused by the filenames (after replacing fileName and filePath by some user-defined string no error occurs and the output is valid XML).
On closer inspection, the randomGit repo contains 3 files whole file names are not valid UTF-8, and these are also the files that cause the problem. The Caja file manager shows an "invalid encoding" message next to their names:
Some other applications (e.g. Bless hex editor) also cannot open them for this reason. So the solution would be to test for UTF-8 validity and then take action based on the outcome of that.
Already tried:
fileName = os.path.basename(file).decode("UTF-8", "strict")
(and same for filePath) --> results in UnicodeEncodeError (probably because fileName and filePath now are Unicode strings from the moment they're created).So apparently the problem is caused by surrogate pairs in the file names. Tricky to solve, but here's something:
Also:
https://github.com/PythonCharmers/python-future/issues/116
Or perhaps:
Based on this, here's a solution that works in Python 3.x (but not in Python 2.x):
https://gist.github.com/bitsgalore/f65bcfc7a470b9fe9b90#file-stripsp_3xonly-py
On the other hand, the solution below uses regex (based on this). This works for Py 2.7, but strangely the regex results in a SyntaxError in Py 3.x:
https://gist.github.com/bitsgalore/f65bcfc7a470b9fe9b90#file-stripsp_2xonly-py
So the solution should combine both approaches (depending on Python version).
Following fix works for both Py 2.x and 3.x:
https://github.com/openpreserve/jpylyzer/commit/18aac4571859fcf9c8adf3a222a35805a46bc7a4 https://github.com/openpreserve/jpylyzer/commit/5582e3aed03a5ff8b46068a94082a59c2cc73bc7
Which should solve this issue.
Under Windows, two things go wrong if the name of an input file contains non-Western characters:
First, in a regular wildcard scan on an entire directory, files with non-Western character in their names are ignored. The issue can be reproduced by running jplyzer on the contents of https://github.com/openpreserve/jpylyzer-test-files:
In this case, output for ランダム日本語テキスト.jp2 is missing in whatever.xml
In addition, running a recursive scan like this:
This triggers the following WindowsError:
Both errors only happen under Windows with Python 2.7. Under Windows the behaviour with Python 3.3 is correct; under Linux (Mint) the problem doesn't occur at all.