kienerj / pycdxml

Tools to automatically convert and proccess cdx and cdxml files in python
GNU General Public License v3.0
35 stars 5 forks source link

Error when attempting decode of non-utf8 bytes #34

Closed rytheranderson closed 1 year ago

rytheranderson commented 1 year ago

The try/except here: https://github.com/kienerj/pycdxml/blob/6fd2b9cdaf4ca92e1a5f285be34b9899fcfe5538/pycdxml/cdxml_converter/chemdraw_types.py#L114C1-L123C60 attempts to decode a character that had already failed utf8 decoding again with utf8. I think this should be something like:

except LookupError:
    logger.warning("Found unsupported charset. Retrying with 'utf8'.")
    stream.seek(stream.tell() - text_length)
    value = stream.read(text_length).decode('utf8')
except UnicodeDecodeError:
    logger.warning("Found unsupported character. Retrying with 'utf8'.")
    stream.seek(stream.tell() - text_length)
    value = stream.read(text_length).decode('utf8', errors="replace")

This will add the Unicode replacement character U+FFFD (�) for invalid bytes: making it clear to the user that the byte was not decoded, but still allowing for "successful" document conversion.

kienerj commented 1 year ago

The first error, LookupError happens when the cdxml contains a charset python doesn't support or is written in such a way it's not recognized. One can debate if it makes sense to retry with utf8 at all but in many cases that will then work just fine.

The UnicodeDecodeError simply gets triggered when a decoding issue happened. I had file examples for which the charset didn't really match the actual content and then simply retrying with utf8 instead of what the file says to use, solves the issue.

But you are right that there could be a case that the charset is already set to "utf8" and then you would just be repeating the same thing over again.

Anyway I fixed it this way:

     try:
        value = stream.read(text_length).decode(charset)
    except LookupError:
        logger.warning("Found unsupported charset. Retrying with 'utf8'.")
        stream.seek(stream.tell() - text_length)
        value = stream.read(text_length).decode('utf8')
    except UnicodeDecodeError:
        stream.seek(stream.tell() - text_length)
        if charset == 'utf8':
            logger.warning("Found unsupported character for utf8. Retrying with errors=='replace'.")
        else:
            logger.warning(f"Found unsupported character for charset {charset}. "
                           f"Retrying with 'utf8' and errors=='replace'.")
        value = stream.read(text_length).decode('utf8', errors="replace")

this seems the least strict with the most information