kienerj / pycdxml

Tools to automatically convert and proccess cdx and cdxml files in python
GNU General Public License v3.0
35 stars 5 forks source link

Error when reading CDX with enhanced stereochemistry #30

Closed rytheranderson closed 1 year ago

rytheranderson commented 1 year ago

When reading CDX files with enhanced stereochemistry annotations I get an AttributeError originating on this line: https://github.com/kienerj/pycdxml/blob/3a240aafb9f61b56aea5344e9d481b4ed965599a/pycdxml/cdxml_converter/chemdraw_types.py#L139

From what I have found, font is None because (x for x in fonttable.fonts if x.id == font_id) is empty. It is empty because no font IDs in fonttable.fonts matches the first font style ID. I'm having difficulty debugging further due to lack of knowledge.

There are two CDX examples in the attached zip, here is some minimal code that reproduces the issue using those examples:

from pycdxml.cdxml_converter import ChemDrawDocument

for test in ["no_enhanced_stereo.cdx", "enhanced_stereo.cdx"]:
    with open(test, "rb") as file:
        chemdraw = ChemDrawDocument.from_bytes(file)
        chemdraw.to_cdxml()
        print(f"No errors when converting {test}")

cdx_examples.zip

kienerj commented 1 year ago

Thanks especially also for submitting the files. Where the files created directly in ChemDraw 21 or manipulated by some other application?

Doing the missing null check

if font is None:
    logger.warning(f"Found font with invalid font id {font_id}. Using default charset {CDXString.DEFAULT_CHARSET}.")
    return CDXString.DEFAULT_CHARSET
charset_id = font.charset

Removes this specific error.

However a new issue appears in that enhanced stereochemistry label is displayed twice (after conversion to cdxml).

Need to further look into that but hence why I ask source of cdx because it seems "buggy" but sometimes the format changes or the documentation is missing/incomplete.

Note for myself: new property UTF8Text -> seems to be for cdx only, do not convert to cdxml as ChemDraw doesn't do so either objecttag objects name property is empty but should be "enhancedstereo" (this cause duplicate display). Check if name property is also empty in cdx already?

rytheranderson commented 1 year ago

Thanks for the info! These files were created directly in ChemDraw 21 with no further manipulation (that I am aware of, I just drew the examples and "saved as" CDX). I use ACS 1996 Document style sheet as default, but I did see the same issue with various styles.

kienerj commented 1 year ago

OK thanks for the info.

this is a typical pain with ChemDraw (and hence cdx). that chemdraw itself often ignores the format specification but due to that also can deal with such issues.

In this case per official specification here a string should indicated how many style it contains with a uint16 (eg 2 bytes). However in your file in at least one instance it only uses 1 byte and hence everything else read afterwards is wrong and that is also why the issue with the missing font happens as everything is one byte off...

Note for myself:

in CDXString figure out how to deal with 0 styles that only use 1 byte to say "no styles".

rytheranderson commented 1 year ago

Interesting, thanks for sharing the official specification. The off-by-1-byte issue makes sense broadly, although I'm sure I'm missing key details. I see the duplicated stereo annotations when pasting the CDXML back into ChemDraw, however, it still converts to the correct V3000 (out of ChemDraw), and the underlying structure is correct. Do you think there is a path to a solution for this? Can I help in any way?

kienerj commented 1 year ago

In the converted cdxml, after the hacky fix with the missing font, you will have this:

<objecttag TagType="Unknown" Name=""><t id="0" p="270.59 284.72" BoundingBox="270.59 278.37 279.76 284.82" 
CaptionLineHeight="variable" LineHeight="variable"><s font="3" size="7.5" face="0" color="0">&amp;1</s></t>
</objecttag>

The important part is the empty Name="" which must actually be Name="enhancedstereo"

Then the label will be correctly displayed only once. So yes it's only a display issue. conversion in ChemDraw to other formats will contain correct info.

To no surprise it's the "enhancedstereo" that has the issue with the missing byte so that's probably why it's empty and then triggers the double display of the enhanced stereo symbol.

So the solution is to figure out how to properly deal with this 1 byte issue. Not clear yet how to fix this due to at the spot the reading happens there os no context from where the string is coming.

@staticmethod
def from_bytes(property_bytes: bytes, charset='iso-8859-1', fonttable=None) -> 'CDXString':

    stream = io.BytesIO(property_bytes)
    style_runs = int.from_bytes(stream.read(2), "little")
    font_styles = []
    style_starts = []
    for idx in range(style_runs):

See the second line of this method? here I would need to determine if it's 1 or 2 bytes. but with complete lack of content unclear. Throw exception when wrong font id is found and handle that by rereading with 1 byte?

It could also just be a ChemDraw 21 bug as I'm still on 19 and haven't encountered this before.

rytheranderson commented 1 year ago

Alright, that makes things a lot clearer, I will continue to experiment and see if I can propose a solution.

kienerj commented 1 year ago

Can you try with latest commit? that fixes it for me

in fact I was wrong and it is a 2 byte offset as the "number of styles" is entirely omitted.

Solution: raise exception in case of missing font (eg invalid font id), catch it an reread bytes taking above consideration into account. This now generates a visually correct cdxml file.

EDIT: and potentially if possible if you have a set of files to share from ChemDraw 21 or try them yourself and see if they work might also help to find additional issues especially regarding ChemDraw 21.

rytheranderson commented 1 year ago

I can confirm that the examples work with the latest commit, thanks! I will prepare a more comprehensive set of files with ChemDraw 21 and share here soon.

rytheranderson commented 1 year ago

Here are a few more examples from ChemDraw 21: chemdraw21_examples.zip

kienerj commented 1 year ago

thanks. I'm closing this now as the core issue is solved. above files all convert ok.