Closed rytheranderson closed 1 year ago
Thanks for reporting. This test file actually also leads to other issues (eg utf8 text in the Keyword and Content) which are now fixed as well.
I'm using the solution from:
https://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python
to remove, in case of need, all invalid xml characters.
re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', text)
Having said that I suggest you to remove the example file as I suspect the original structure can be extracted from the annotations in this file (SMILES) and might be sensitive.
If control characters are present the result is a
ValueError
fromlxml
:ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
here: https://github.com/kienerj/pycdxml/blob/0e33c97fd918be84b8e23e9d1a20cedc23770cda/pycdxml/cdxml_converter/chemdraw_objects.py#LL497C17-L497C17, and likely in other places whereCDXString.str_value
is inserted into XML elements.I attached an example showing such a case, where the
\x03
(^B) control character is present twice in a Keyword property.I would suggest something like
In the
CDXString.from_bytes
constructor, here. Which will replace\x00
through\x06
with nothing. Other control characters, (\t
,\v
, etc.) could be replaced on a case-by-case basis like\r
(if they are an issue).