openpreserve / fido

Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.
http://openpreservation.org/technology/products/fido/
Other
145 stars 47 forks source link

`prepare` raises an exception when serializing the new document in PRONOM 89 (Python 2) #104

Closed mistydemeo closed 7 years ago

mistydemeo commented 7 years ago

When trying to update to PRONOM 89, I encountered an exception being raised by the ElementTree serializer when trying to write Fido's reformatted XML documents. In the original context this happens when we serialize a single large XML document. To isolate the bug I adapted the script to write out each individual converted PRONOM record, and confirmed that

  1. This happens when we write individual converted records (but not all of them), and
  2. It does not happen if we write out the original unaltered documents.

Sample traceback:

Traceback (most recent call last):
  File "test.py", line 473, in <module>
    print(ET.tostring(doc), file=devnull)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 380: ordinal not in range(128)
mistydemeo commented 7 years ago

This is possibly related to #83, but doesn't go through that codepath.

mistydemeo commented 7 years ago

I converted this to use lxml for debugging, and it turns out lxml produces a better traceback here:

Traceback (most recent call last):
  File "test.py", line 471, in <module>
    doc = parse_pronom_xml(f)
  File "test.py", line 377, in parse_pronom_xml
    ET.SubElement(fido_sig, 'note').text = get_text_tna(pronom_sig, 'SignatureNote').encode('UTF-8')
  File "lxml.etree.pyx", line 951, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:46353)
  File "apihelpers.pxi", line 695, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20953)
  File "apihelpers.pxi", line 683, in lxml.etree._createTextNode (src/lxml/lxml.etree.c:20829)
  File "apihelpers.pxi", line 1393, in lxml.etree._utf8 (src/lxml/lxml.etree.c:27125)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The explicit .encode calls broke this, since the resulting UTF8-encoded bytestrings contain non-ASCII characters, and ElementTree is treating this as a string to be converted rather than a Unicode string.