monarch-initiative / omim

Data ingest pipeline for OMIM.
7 stars 3 forks source link

Bug: Symbols not always uppercase #126

Closed joeflack4 closed 1 month ago

joeflack4 commented 2 months ago

Overview

In working on #119, I noticed that some of the the mondo#abbreviation we're adding (these are actually symbols) are lowercase.

This is not expected. mondo#abbreviations should all be uppercase. And also I would expect that symbols always be uppercase too, as I learned from Trish. And a cursory search seems to corroborate that.

The source of this bug in the code appears to be cleanup_label() and _detect_abbreviations().

I think it could be that (a) there's bugs in these functions, or (b) these functions simply should not be applied to symbols--at least not the ones on "Preferred title; symbol", and are only supposed to be used on other labels or parts of labels.

Examples

OMIM:126370 - hs3

OMIM:126370 a owl:Class ;
    rdfs:label "dna, satellite, 3" ;
    MONDO:exclusionReason MONDO:excludeTrait ;
    oboInOwl:hasExactSynonym "D1Z1",
        "dna, satellite, 3",
        "hs3" .

Sub-tasks

twhetzel commented 2 months ago

@joeflack4 the original content from OMIM should be parsed so that what is labeled as a symbol by OMIM becomes tagged as an abbreviation in the omim.owl file that is created. The capitalization here (of the OMIM content) does not matter for tagging something as an abbreviation.

joeflack4 commented 2 months ago

I'm fairly confident that this is a bug in the code, based on what I've seen in just logically. I think I know where it is. And it may only happen when there are multiple symbols.

But I think it does matter, because this will make its way into mondo. We will have things that are marked as abbreviations otherwise do not pass our normal criteria. That is, it will be lowercase.

Unless I am misunderstanding the importance of this criteria. Is it important that we have it consistent? That we make sure that 100% of our synonyms marked as abbreviations are indeed uppercase, etc?

If that's not important, then we can close this issue.

If we do want to reach 100%, some of this can be dealt with by making sure the pipelines are correct, but I'm sure some of this will be on the curation side as well, assuming there are currently some synonyms in Mondo already that have this problem.

twhetzel commented 1 month ago

In mimTitles.txt the "symbol" value from the OMIM column "Preferred Title; symbol" should be parsed out from that column and the "symbol" value should not have any changes in it's capitalization as it's found in the file.

joeflack4 commented 1 month ago

That sounds correct to me as well.

I'll add this bug fix then to #128 or I can make a separate PR for that first if you want.

Going with option (b), not using these case-changing function on these symbols at all.

joeflack4 commented 1 month ago

resolved by #130