Open nourG22 opened 5 months ago
Thanks @nourG22 for the very detailed reporting.
I have discussed this issue with the rest of the team, and we were thinking of an alternative solution which would provide more flexibility. Your report was very useful to kickstart the discussion with a realistic use case.
The proposed alternative is the following: instead of replacing the text within the span produced by the SimstringMatcher, the span could be enhanced by another normalization (whose name is up for discussion, let's call it typo correction for the sake of it), which may then be applied to the text in a follow-up operation (up for discussion too) or just carried around within the rest of the processing.
This way, users still have the choice to keep or correct for the typo, or use an alternative normalization (e.g. UMLS) and still carry the information about the match around.
Issue
When using SimString for typo detection, it accurately identifies misspelled words like "mxnidipine" but lacks functionality to produce the corrected version of the word, such as "manidipine".
Reproduction Steps
Current Behavior
SimString accurately identifies misspelled terms but does not provide the corrected version.
Cause
The current logic of SimString is constrained within a parent class, lacking a specialized
run()
method which hinders the generation of corrected terms.Suggested Solution
Implement a
run()
specialization within the SimString matcher class to enable the generation of corrected terms. This specialization should extend to both SimString matcher and regular expression matcher functionalities.