Extracted XML incomplete using Python 2.7

openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.

http://jpylyzer.openpreservation.org/

Other

69 stars 28 forks source link

Extracted XML incomplete using Python 2.7 #44

Closed bitsgalore closed 10 years ago

bitsgalore commented 10 years ago

While running jpylyzer with --nullxml OPTION on this file:

http://sdowww.lmsal.com/sdomedia/hv_jp2kwrite/v0.8/jp2/AIA/2014/02/01/304/2014_02_01__00_11_07_13__SDO_AIA_AIA_304.jp2

Doing this with Python 2.7, some of the resulting XML elements are empty, even though in reality they do contain text (e.g. look at at the bottom). Using Python 3.3 it works correctly. So something goes wrong while parsing the XML. Could be a bug in ElementTree.

This also affects the Windows executables / Debian packages, since they are built using Python 2.7.

bitsgalore commented 10 years ago

Had another look at this: the actual parsing of the XML isn't the problem, but post-processing in etpatch.py/makeHumanReadable goes wrong in convert step at the bottom.

What might work:

in byteconv.py/bytesToText, set encoding to "utf-8", remove check for control characters and clean up decoded bytes using: http://stackoverflow.com/a/19016117/1209004

def remove_control_characters(s): return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

bitsgalore commented 10 years ago

Fixed in 1.11.1!