Open twhetzel opened 1 month ago
I see now that in the label in ICD11, for these entries there is an extra space between at least one of the words(tokens), but unclear why these are not reported in the "lex_exact" file.
Interesting. There's a lot that I don't know about lexmatch, but I agree with you that whitespace stripping should happen before it runs. I suppose it's worth checking at some point to see if that's happening.
The code that separates exact matches from non-exact ones has nothing to do with lexmatch. You will have to find the python script where the separation occurs, then the specific conditional that decides wether it is exact or not and then add some preprocessing in there. Lexmatch itself can handle the whitespace!
By "nothing to do with lexmatch", I'm guessing you mean "nothing to do with OAK's lexmatch functionality", as opposed to the "mondo-ingest lexmatch pipeline". Because I would find it hard to imagine the source of the issue would exist outside of the lexmatch pipeline.
As a cursory investigation, I see:
lexmatch-sssom-compare.py, in export_unmatched_exact():
unmapped_exact = unmapped_df[
(unmapped_df["comment"] == match_type)
& (unmapped_df["predicate_id"] == "skos:exactMatch")
but I don't see where the non-exact file is exported. I'll stop for now but @twhetzel Let me know if you want me to take over looking into this.
@joeflack4 - yes, if you can sort out this bug that would be good.
@hrshdhgd Re-assigning this to you now that you're back! (WB!)
You may notice that in "Development" on the right, this is connected a PR that has been merged. That PR ended up being unrelated, but it looks like now I can't unlink it.
@joeflack4 I won't be able to get to this since my priorities have changed. I'm punting this back to your plate.
I was reviewing the ICD11 "lex" file and there are 4 exact matches where the subject and object match on their respective labels. Is this is a bug in the lexical alignment or an issue with running the pipeline or ???
The exact matches are for these Mondo terms:
MONDO:0003595
MONDO:0012089
MONDO:0021495
MONDO:0800453