ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

UnicodeDecodeError while reading Entry 44 using ihm.reader #59

Closed saijananiganesan closed 3 years ago

saijananiganesan commented 3 years ago

I don't see anything unusual in the file, not sure why I am getting this error.

Exact line in code:

        with open(self.mmcif_file) as fh:
            self.system, = ihm.reader.read(fh, model_class=self.model)

Error: XX/ihm/reader.py", line 3173, in read more_data = r.read_file() XX/ihm/format.py", line 566, in read_file return self._read_file_c() XX/hm/format.py", line 616, in _read_file_c eof, more_data = _format.ihm_read_file(self._c_format) XX/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 24322: invalid start byte

benmwebb commented 3 years ago

Short answer: add , encoding='latin1' to your open call.

Long answer: mmCIF files are supposed to be ASCII (7-bit). All ASCII files are also valid UTF-8 by construction. But this file contains non-ASCII (8-bit) characters (the error refers to a degree symbol in _flr_inst_setting.details) which are not valid UTF-8. There is no easy way to determine programmatically and unambiguously what the encoding is supposed to be, but if you don't care about these symbols (I'm guessing you don't) latin1 (or ISO-8859-1) is also a superset of ASCII and will accept any 8-bit character (it might not match what the original author intended though). Alternatively, you can open the file in binary mode, which will be handled by python-ihm in the same way, as if latin1-encoded.

saijananiganesan commented 3 years ago

Thanks Ben!