medkit-lib / medkit

Toolkit for a learning health system
https://medkit-lib.org/
MIT License
15 stars 8 forks source link

Simstring matcher could produce span with corrected term #38

Open nourG22 opened 5 months ago

nourG22 commented 5 months ago

Issue

When using SimString for typo detection, it accurately identifies misspelled words like "mxnidipine" but lacks functionality to produce the corrected version of the word, such as "manidipine".

Reproduction Steps

  1. Input text containing misspelled words, such as "mxnidipine".
  2. Utilize SimString for word detection, even when encountering typos.

Current Behavior

SimString accurately identifies misspelled terms but does not provide the corrected version.

Cause

The current logic of SimString is constrained within a parent class, lacking a specialized run() method which hinders the generation of corrected terms.

Suggested Solution

Implement a run() specialization within the SimString matcher class to enable the generation of corrected terms. This specialization should extend to both SimString matcher and regular expression matcher functionalities.

ghisvail commented 5 months ago

Thanks @nourG22 for the very detailed reporting.

I have discussed this issue with the rest of the team, and we were thinking of an alternative solution which would provide more flexibility. Your report was very useful to kickstart the discussion with a realistic use case.

The proposed alternative is the following: instead of replacing the text within the span produced by the SimstringMatcher, the span could be enhanced by another normalization (whose name is up for discussion, let's call it typo correction for the sake of it), which may then be applied to the text in a follow-up operation (up for discussion too) or just carried around within the rest of the processing.

This way, users still have the choice to keep or correct for the typo, or use an alternative normalization (e.g. UMLS) and still carry the information about the match around.