morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
186 stars 44 forks source link

Speller: words containing the dictionary separator are not handled properly #86

Closed jaumeortola closed 7 years ago

jaumeortola commented 7 years ago

We found this bug in LanguageTool using the British English dictionary. See: https://github.com/languagetool-org/languagetool/issues/619

The dictionary has this structure: <word form><separator><byte containing frequency information A..Z>

When a word like "eta_I" is looked up in the speller, the speller stops working for all the next words. I have written a test here.

The problem is clearly in the method isInDictionary(). Once containsSeparators = false;, it is never initialized to true again and isInDictionary is false for all next words.

An obvious solution is to check if the original word contains the separator and then return false in isInDictionary() even before searching for the word, because it is just impossible to find such a word in the dictionary.

Anyway, I don't understand the logic for the variable containsSeparators, which should be reinitialized to true somewhere. @milekpl

jaumeortola commented 7 years ago

The bug doesn't happen with other separator characters, like "+". So perhaps the issue is related to https://github.com/morfologik/morfologik-stemming/issues/85.

dweiss commented 7 years ago

Thanks Jaume! Is there anything else coming or do you want me to publish a point release?

dweiss commented 7 years ago

Went ahead and released 2.1.3, all tests passed.