monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

omim.ttl has multiple labels on classes #969

Open matentzn opened 3 years ago

matentzn commented 3 years ago

omim.ttl has classes which have multiple rdfs:label annotations, such as:

ORPHA:118231 a owl:Class ;
    rdfs:label "Capillary malformation-arteriovenous malformation",
        "Parkes Weber syndrome" ;
    biolink:category biolink:Disease .

By my regex count (rdfs:label.*,$) about 550. Having multiple labels is problematic for many reasons, including potential non-determinism in what the UI presents to the user but also (which is my problem), it is not obo-conformant, which means OBO parsers that are set to strict will fail when they encounter these cases. Could we perhaps select a label (for example the alphabetically sorted first) as rdfs:label and store the remaining labels as exactSynonym? I dont know what the practice is here generally, happy to learn how things go.

matentzn commented 3 years ago

I used: https://archive.monarchinitiative.org/alpha/rdf/omim.ttl

TomConlin commented 3 years ago

Not convinced the OMIM ingest is the correct place to attribute labels to ORPHA classes in any case.

Is there any background on why this was done in the first place? Maybe to make reading the output more "friendly" for humans?

kshefchek commented 3 years ago

I suspect Orphanet's recent changes have broken OMIM's own ingest of their data, go to https://www.omim.org/entry/268800 and click Clinical Resources -> Orphanet on the sidebar: there are 3 labels going to the same resource theres a disease (Sandhoff disease) that links to ORPHA:None This is also in their API so it affects our ingest.

But as @TomConlin is saying we could remove labels all together to avoid this