Closed mistydemeo closed 7 years ago
This is possibly related to #83, but doesn't go through that codepath.
I converted this to use lxml for debugging, and it turns out lxml produces a better traceback here:
Traceback (most recent call last):
File "test.py", line 471, in <module>
doc = parse_pronom_xml(f)
File "test.py", line 377, in parse_pronom_xml
ET.SubElement(fido_sig, 'note').text = get_text_tna(pronom_sig, 'SignatureNote').encode('UTF-8')
File "lxml.etree.pyx", line 951, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:46353)
File "apihelpers.pxi", line 695, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20953)
File "apihelpers.pxi", line 683, in lxml.etree._createTextNode (src/lxml/lxml.etree.c:20829)
File "apihelpers.pxi", line 1393, in lxml.etree._utf8 (src/lxml/lxml.etree.c:27125)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The explicit .encode
calls broke this, since the resulting UTF8-encoded bytestrings contain non-ASCII characters, and ElementTree
is treating this as a string to be converted rather than a Unicode string.
When trying to update to PRONOM 89, I encountered an exception being raised by the ElementTree serializer when trying to write Fido's reformatted XML documents. In the original context this happens when we serialize a single large XML document. To isolate the bug I adapted the script to write out each individual converted PRONOM record, and confirmed that
Sample traceback: