ihmwg / python-ihm

Python package for handling IHM mmCIF and BinaryCIF files
MIT License
14 stars 7 forks source link

Remove UTF characters #77

Closed bienchen closed 2 years ago

bienchen commented 2 years ago

mmCIF can do UTF8 but it's discouraged... basically lots of tools can deal with UTF8 in the mmCIF universe, but the RCSB validation tool can not. Therefore I changed Žídek A in ihm.citations.alphafold2 to Zidek A, like the name of this author is spelled in other publications.

benmwebb commented 2 years ago

mmCIF can do UTF8 but it's discouraged

Really, by whom? Both @brindakv and John W in the past have at least strongly hinted that UTF-8 is the most appropriate encoding for mmCIF (and it is mandated for BinaryCIF). Many (perhaps most) PDB-Dev depositions are not plain ASCII either - they are either UTF8 or latin1/iso-8859-1.

the RCSB validation tool can not [handle UTF-8]

@brindakv, can this be fixed? Is it going to be?

I am happy to merge this but there are other citations (e.g. imp, hhpred) that are UTF-8, which have been "working" for some time without issues.

bienchen commented 2 years ago

Just recognised this one: https://www.iucr.org/resources/cif/spec/version1.1/semantics#markup So in theory, by CIF standard its all ASCII but there is a markup extension (looks a bit LaTeX-like to me) for all kinds of special letters.