CifCheck validation fails on files with non-ASCII characters

ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files

MIT License

14 stars 7 forks source link

CifCheck validation fails on files with non-ASCII characters #131

Closed aozalevsky closed 6 months ago

aozalevsky commented 6 months ago

Though mmCIF standard allows UTF-8 and python-ihm also supports it, PDB-DEV's internal mmCIF validation tool fails on non-ASCII characters. The most common case - IMP reference, which is added to the output automatically and has "Velázquez-Muriel J" as one of the authors.

The easiest change would be to update ihm.dumper.write docs with the explicit example compatible with PDB-DEV.

benmwebb commented 6 months ago

I can update the docs with an example. There are two options that are straightforward in Python: use ASCII encoding (and replace non-ASCII characters with ?) or use latin1/iso-8859-1 encoding, which would preserve most Western European accents at least. I've definitely seen latin1 files in PDB-Dev, so maybe CifCheck likes those.