Surrogate pairs in UTF-8 filenames cause exception

bitsgalore commented 8 years ago

Under Windows, two things go wrong if the name of an input file contains non-Western characters:

First, in a regular wildcard scan on an entire directory, files with non-Western character in their names are ignored. The issue can be reproduced by running jplyzer on the contents of https://github.com/openpreserve/jpylyzer-test-files:

jpylyzer * > whatever.xml

In this case, output for ランダム日本語テキスト.jp2 is missing in whatever.xml

In addition, running a recursive scan like this:

jpylyzer --recurse . > ../testall.xm

This triggers the following WindowsError:

  File "f:\johan\pythonCode\jpylyzer\jpylyzer\jpylyzer.py", line 327, in checkOn
eFile
    "fileSizeInBytes", str(os.path.getsize(file)))
  File "C:\Python27\lib\genericpath.py", line 49, in getsize
    return os.stat(filename).st_size
WindowsError: [Error 123] The filename, directory name, or volume label syntax i
s incorrect: 'e:\\test\\???????????.jp2'

Both errors only happen under Windows with Python 2.7. Under Windows the behaviour with Python 3.3 is correct; under Linux (Mint) the problem doesn't occur at all.

bitsgalore commented 8 years ago

Now fixed: https://github.com/openpreserve/jpylyzer/commit/b238eac2d42c09a84b7b5e81749416ee11a0604f

bitsgalore commented 8 years ago

So the changes in https://github.com/openpreserve/jpylyzer/commit/b238eac2d42c09a84b7b5e81749416ee11a0604f fixed the original issue, but they created a new one. As a test I ran the modified code on a set of files whose names are made up of random unicode characters. The dataset can be found here:

https://github.com/mo/randomgit

I ran all files through the following command (using Py 2.7):

python ~/jpylyzer/jpylyzer/jpylyzer.py  --wrapper * >../all.xml

Result:

  File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 685, in <module>
    main()
  File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 681, in main
    checkFiles(args.inputRecursiveFlag, args.inputWrapperFlag, jp2In)
  File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 610, in checkFiles
    writeElement(xmlElement, out)
  File "/home/johan/jpylyzer/jpylyzer/jpylyzer.py", line 553, in writeElement
    xmlPretty = minidom.parseString(xmlOut).toprettyxml('    ')
  File "/usr/lib/python2.7/xml/dom/minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 116

Since the error happens in the part of the code that does the pretty-printing, I re-ran jpylyzer with the --nopretty switch. In that case no errors are reported, but the output still contains illegal characters. So the actual problem seems to occur earlier on.

Error is caused by the filenames (after replacing fileName and filePath by some user-defined string no error occurs and the output is valid XML).

On closer inspection, the randomGit repo contains 3 files whole file names are not valid UTF-8, and these are also the files that cause the problem. The Caja file manager shows an "invalid encoding" message next to their names:

screenshot-randomgit

Some other applications (e.g. Bless hex editor) also cannot open them for this reason. So the solution would be to test for UTF-8 validity and then take action based on the outcome of that.

Already tried:

Reinstate fileName = os.path.basename(file).decode("UTF-8", "strict") (and same for filePath) --> results in UnicodeEncodeError (probably because fileName and filePath now are Unicode strings from the moment they're created).

bitsgalore commented 8 years ago

So apparently the problem is caused by surrogate pairs in the file names. Tricky to solve, but here's something:

http://stackoverflow.com/questions/18673213/detect-remove-unpaired-surrogate-character-in-python-2-gtk

Also:

https://github.com/PythonCharmers/python-future/issues/116

Or perhaps:

http://stackoverflow.com/questions/3220031/how-to-filter-or-replace-unicode-characters-that-would-take-more-than-3-bytes

bitsgalore commented 8 years ago

Based on this, here's a solution that works in Python 3.x (but not in Python 2.x):

https://gist.github.com/bitsgalore/f65bcfc7a470b9fe9b90#file-stripsp_3xonly-py

On the other hand, the solution below uses regex (based on this). This works for Py 2.7, but strangely the regex results in a SyntaxError in Py 3.x:

https://gist.github.com/bitsgalore/f65bcfc7a470b9fe9b90#file-stripsp_2xonly-py