kienerj / pycdxml

Tools to automatically convert and proccess cdx and cdxml files in python
GNU General Public License v3.0
35 stars 5 forks source link

Non-XML compatible control characters may be present in CDXString value when constructing from bytes #32

Closed rytheranderson closed 1 year ago

rytheranderson commented 1 year ago

If control characters are present the result is a ValueError from lxml: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters here: https://github.com/kienerj/pycdxml/blob/0e33c97fd918be84b8e23e9d1a20cedc23770cda/pycdxml/cdxml_converter/chemdraw_objects.py#LL497C17-L497C17, and likely in other places where CDXString.str_value is inserted into XML elements.

I attached an example showing such a case, where the \x03 (^B) control character is present twice in a Keyword property.

I would suggest something like

value = value.replace("\r", "\n").translate(dict.from keys(range(7)))

In the CDXString.from_bytes constructor, here. Which will replace \x00 through \x06 with nothing. Other control characters, (\t, \v, etc.) could be replaced on a case-by-case basis like \r (if they are an issue).

kienerj commented 1 year ago

Thanks for reporting. This test file actually also leads to other issues (eg utf8 text in the Keyword and Content) which are now fixed as well.

I'm using the solution from:

https://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python

to remove, in case of need, all invalid xml characters.

re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', text)

Having said that I suggest you to remove the example file as I suspect the original structure can be extracted from the annotations in this file (SMILES) and might be sensitive.