monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
604 stars 76 forks source link

Grounder doesn't handle non-ascii labels well #160

Open cmungall opened 1 year ago

cmungall commented 1 year ago

E.g. for the example in #159

...Sjögren's Syndrome...

  - id: AUTO:Sj%C3%B6gren%27s%20Syndrome
    label: Sjögren's Syndrome

fails to match on

name: Sjogren syndrome
synonym: "Sjogren syndrome" EXACT []
synonym: "Sjögren syndrome" EXACT []
synonym: "Sjögren-Gougerot syndrome" EXACT []
synonym: "primary Sjogren-Gougerot syndrome" EXACT []
synonym: "primary Sjögren-Gougerot syndrome" EXACT []
synonym: "sicca syndrome" EXACT []
synonym: "syndrome, Sjogren's" EXACT []
synonym: "xerodermosteosis" EXACT []
synonym: "Sjogren's syndrome" RELATED []
synonym: "primary Sjögren syndrome" RELATED []
caufieldjh commented 1 year ago

Looks like a case for unidecode - or some other pre-match processing to normalize strings.