monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
5 stars 3 forks source link

mapping: label only partially lex-match #221

Open sabrinatoro opened 1 year ago

sabrinatoro commented 1 year ago

From https://github.com/monarch-initiative/mondo-ingest/blob/main/src/ontology/lexmatch/unmapped_ncit_lex.tsv

subject_id subject_label predicate_id object_id object_label mapping_justification mapping_tool confidence subject_match_field object_match_field match_string comment
MONDO:0005571 polycythemia MONDO:equivalentTo NCIT:C27794 Polycythemia (Excluding Polycythemia Vera) semapv:LexicalMatching oaklib 0.8497788952 rdfs:label rdfs:label polycythemia LEXMATCH

the lex-match was done on part of the label: the label in the object has more information than in the subject. The match should not be exact.

Note: I think i have reported this for the exact match based on label (for the other "exact" mapping file). I couldn't find the issue, but feel free to merge with the other issue (or close if it has already been taken care of)

matentzn commented 1 year ago

How often does this happen? The mapping rule generating this wrong suggestion is quite significant (removing things in brackets from a label, which is very often super noisy).. I would opt for documenting this in an SOP for human curators, rather than removing the mapping rule that causes this to happen, but its your choice.

PS can we tag all mapping related issues with a common tag?

sabrinatoro commented 1 year ago

need to check the files we download: the levels might be "clipped" in the files we are getting.

matentzn commented 1 year ago

Dont confuse this with the other issue #206. Here the removal of the brackets is deliberate, part of our mapping rules. In #206 it was unintended.. So the only question you need to answer here is: How often did this decision (to clip stuff in brackets) cause wrong suggestions?

sabrinatoro commented 1 year ago

What is in brackets is most often adding some more specificity to the term (e.g. "unspecified", "except for...", "not included elsewhere",...). Therefore in most cases, the mapping will not be exact. Therefore the bracket shouldn't be removed when making the mapping. That being said, I cannot consistently find examples of mapping where the brackets are included. Therefore I cannot tell whether these are removed by default (and what I am finding is a "bug") or if they are not removed (and I should be expecting more of these).

matentzn commented 1 year ago

Option 1) A potential solution to this problem is for NCIT, to implement a preprocessing step that strips the ( out, thereby making what is in brackets part of the proper label. I think this is better then dropping the mapping rule altogether because in cases like ICD, we have truly irrelevant information in the labels. Option 2) We could also simply delete "truly irrelevant" pieces of information in ingests like ICD10 and decide that information in brackets is more often informative than it is misleading.

@sabrinatoro it boils down to this questions: are brackets more often important or more often irrelevant when establishing an equivalent mapping?