Non-XML compatible control characters may be present in CDXString value when constructing from bytes

kienerj / pycdxml

Tools to automatically convert and proccess cdx and cdxml files in python

GNU General Public License v3.0

35 stars 5 forks source link

If control characters are present the result is a ValueError from lxml: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters here: https://github.com/kienerj/pycdxml/blob/0e33c97fd918be84b8e23e9d1a20cedc23770cda/pycdxml/cdxml_converter/chemdraw_objects.py#LL497C17-L497C17, and likely in other places where CDXString.str_value is inserted into XML elements.

I attached an example showing such a case, where the \x03 (^B) control character is present twice in a Keyword property.

I would suggest something like

value = value.replace("\r", "\n").translate(dict.from keys(range(7)))

In the CDXString.from_bytes constructor, here. Which will replace \x00 through \x06 with nothing. Other control characters, (\t, \v, etc.) could be replaced on a case-by-case basis like \r (if they are an issue).

kienerj / pycdxml

Non-XML compatible control characters may be present in CDXString value when constructing from bytes #32