monarch-initiative / omim

Data ingest pipeline for OMIM.
7 stars 3 forks source link

Multiple abbreviations not handled correctly #125

Closed joeflack4 closed 5 days ago

joeflack4 commented 2 weeks ago

Overview

In working on #119, I noticed that the code for handling abbreviations could use a little work. There are not many cases of multiple symbols / abbreviations. In fact, I've just found 2 cases:

OMIM:126370: DNA, SATELLITE, III; HS3; D1Z1 OMIM:171820: PHOSPHATASE, SALIVARY ACID, A; SACP; ACPS

I'm not sure how many issues are being caused by improper handling of multiple abbreviations, but there appears to be at least some...

Issue 1: rdfs:label

It adds the rdfs:label for one of the abbreviations. And apparently, given the way the code works (i.e. setting abbrev as primary label for genes), this only affects OMIM:171820.

Issue 2: Synonyms & "modified included label"

There should be an outer for loop here over list of abbreviations. There are these 3 triples for synonyms, and another one for a "modified included label" (not sure why we're creating this). https://github.com/monarch-initiative/omim/blob/dc3c79a5606a495cd7a08623ed4ac17c234d0575/omim2obo/main.py#L207-L215

I don't think there should be abbrev and abbr. There should just be abbreviations.

Questions

joeflack4 commented 2 weeks ago

Notes to self: There shouldn't be exact_labels. I suppose exact means is referring to exactSynonym? There is already pref_label (also renamed to label). Should just use that and extract all abbreviations from it at once.

joeflack4 commented 5 days ago

resolved by #130