Error when attempting decode of non-utf8 bytes

The try/except here: https://github.com/kienerj/pycdxml/blob/6fd2b9cdaf4ca92e1a5f285be34b9899fcfe5538/pycdxml/cdxml_converter/chemdraw_types.py#L114C1-L123C60 attempts to decode a character that had already failed utf8 decoding again with utf8. I think this should be something like:

except LookupError:
    logger.warning("Found unsupported charset. Retrying with 'utf8'.")
    stream.seek(stream.tell() - text_length)
    value = stream.read(text_length).decode('utf8')
except UnicodeDecodeError:
    logger.warning("Found unsupported character. Retrying with 'utf8'.")
    stream.seek(stream.tell() - text_length)
    value = stream.read(text_length).decode('utf8', errors="replace")

This will add the Unicode replacement character U+FFFD (�) for invalid bytes: making it clear to the user that the byte was not decoded, but still allowing for "successful" document conversion.

The first error, LookupError happens when the cdxml contains a charset python doesn't support or is written in such a way it's not recognized. One can debate if it makes sense to retry with utf8 at all but in many cases that will then work just fine.

The UnicodeDecodeError simply gets triggered when a decoding issue happened. I had file examples for which the charset didn't really match the actual content and then simply retrying with utf8 instead of what the file says to use, solves the issue.

But you are right that there could be a case that the charset is already set to "utf8" and then you would just be repeating the same thing over again.

Anyway I fixed it this way:

     try:
        value = stream.read(text_length).decode(charset)
    except LookupError:
        logger.warning("Found unsupported charset. Retrying with 'utf8'.")
        stream.seek(stream.tell() - text_length)
        value = stream.read(text_length).decode('utf8')
    except UnicodeDecodeError:
        stream.seek(stream.tell() - text_length)
        if charset == 'utf8':
            logger.warning("Found unsupported character for utf8. Retrying with errors=='replace'.")
        else:
            logger.warning(f"Found unsupported character for charset {charset}. "
                           f"Retrying with 'utf8' and errors=='replace'.")
        value = stream.read(text_length).decode('utf8', errors="replace")

this seems the least strict with the most information

kienerj / pycdxml

Error when attempting decode of non-utf8 bytes #34