Simstring matcher could produce span with corrected term

medkit-lib / medkit

Toolkit for a learning health system

MIT License

15 stars 8 forks source link

Issue

When using SimString for typo detection, it accurately identifies misspelled words like "mxnidipine" but lacks functionality to produce the corrected version of the word, such as "manidipine".

Reproduction Steps

Input text containing misspelled words, such as "mxnidipine".
Utilize SimString for word detection, even when encountering typos.

Current Behavior

SimString accurately identifies misspelled terms but does not provide the corrected version.

Cause

The current logic of SimString is constrained within a parent class, lacking a specialized run() method which hinders the generation of corrected terms.

Suggested Solution

Implement a run() specialization within the SimString matcher class to enable the generation of corrected terms. This specialization should extend to both SimString matcher and regular expression matcher functionalities.

Thanks @nourG22 for the very detailed reporting.

I have discussed this issue with the rest of the team, and we were thinking of an alternative solution which would provide more flexibility. Your report was very useful to kickstart the discussion with a realistic use case.

The proposed alternative is the following: instead of replacing the text within the span produced by the SimstringMatcher, the span could be enhanced by another normalization (whose name is up for discussion, let's call it typo correction for the sake of it), which may then be applied to the text in a follow-up operation (up for discussion too) or just carried around within the rest of the processing.

This way, users still have the choice to keep or correct for the typo, or use an alternative normalization (e.g. UMLS) and still carry the information about the match around.

medkit-lib / medkit