ncbo / umls2rdf

These python scripts connect to the Unified Medical Language System (UMLS) database and translate the ontologies into RDF/OWL files. This is part of the BioPortal project.
http://bioportal.bioontology.org/
Other
82 stars 38 forks source link

E with acute accent generated incorrectly in ICD10CM TTL file #45

Open jvendetti opened 1 month ago

jvendetti commented 1 month ago

From a user on the BioPortal support list:

In ICD10CM the é characters are displaying incorrectly e.g.: https://bioportal.bioontology.org/ontologies/ICD10CM?p=classes&conceptid=C84.1

This is a link to the class in the UMLS Metathesaurs browser:

https://uts.nlm.nih.gov/uts/umls/vocabulary/ICD10CM/C84.1

... where the name of the class is listed as "Sézary disease".

This is the same class in the ICD10CM TTL file generated by this program:

<http://purl.bioontology.org/ontology/ICD10CM/C84.1> a owl:Class ;
    skos:prefLabel """Sézary disease"""@en ;
    skos:notation """C84.1"""^^xsd:string ;
    rdfs:subClassOf <http://purl.bioontology.org/ontology/ICD10CM/C84> ;
    <http://purl.bioontology.org/ontology/ICD10CM/ORDER_NO> """02478"""^^xsd:string ;
    umls:cui """C0036920"""^^xsd:string ;
    umls:tui """T191"""^^xsd:string ;
    umls:hasSTY <http://purl.bioontology.org/ontology/STY/T191> ;

The e with acute accent in the skos:prefLabel is incorrect.

justin2004 commented 1 month ago

@jvendetti

have you enabled UTF-8 in your environment? e.g. in a debian based distro: https://github.com/SPARQL-Anything/sparql.anything/blob/v1.0-DEV/Dockerfile.development#L11-L16

jvendetti commented 1 month ago

@alexskr - what machine are you using when you generate the UMLS TTL files? Are you able to respond to the previous comment from Justin?

alexskr commented 1 month ago

we set it to en_US.UTF-8 on the os level

justin2004 commented 1 month ago

@alexskr just wanted to note that on a linux distro it isn't enough to set the environment variables. you also need to have the locales installed. the dockerfile i referenced does that.