monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

A few exact matches are in the `unmapped_icd11foundation_lex.tsv` file for ICD11 #523

Open twhetzel opened 1 month ago

twhetzel commented 1 month ago

I was reviewing the ICD11 "lex" file and there are 4 exact matches where the subject and object match on their respective labels. Is this is a bug in the lexical alignment or an issue with running the pipeline or ???

The exact matches are for these Mondo terms:

twhetzel commented 1 month ago

I see now that in the label in ICD11, for these entries there is an extra space between at least one of the words(tokens), but unclear why these are not reported in the "lex_exact" file.

joeflack4 commented 1 month ago

Interesting. There's a lot that I don't know about lexmatch, but I agree with you that whitespace stripping should happen before it runs. I suppose it's worth checking at some point to see if that's happening.

matentzn commented 1 month ago

The code that separates exact matches from non-exact ones has nothing to do with lexmatch. You will have to find the python script where the separation occurs, then the specific conditional that decides wether it is exact or not and then add some preprocessing in there. Lexmatch itself can handle the whitespace!

joeflack4 commented 1 month ago

By "nothing to do with lexmatch", I'm guessing you mean "nothing to do with OAK's lexmatch functionality", as opposed to the "mondo-ingest lexmatch pipeline". Because I would find it hard to imagine the source of the issue would exist outside of the lexmatch pipeline.

As a cursory investigation, I see: lexmatch-sssom-compare.py, in export_unmatched_exact():

    unmapped_exact = unmapped_df[
        (unmapped_df["comment"] == match_type)
        & (unmapped_df["predicate_id"] == "skos:exactMatch")

but I don't see where the non-exact file is exported. I'll stop for now but @twhetzel Let me know if you want me to take over looking into this.

twhetzel commented 1 month ago

@joeflack4 - yes, if you can sort out this bug that would be good.

joeflack4 commented 2 weeks ago

@hrshdhgd Re-assigning this to you now that you're back! (WB!)

You may notice that in "Development" on the right, this is connected a PR that has been merged. That PR ended up being unrelated, but it looks like now I can't unlink it.

hrshdhgd commented 1 week ago

@joeflack4 I won't be able to get to this since my priorities have changed. I'm punting this back to your plate.