languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.2k stars 1.38k forks source link

multi word spell checking performance #2961

Open ghost opened 4 years ago

ghost commented 4 years ago

Unfortunately, there is a limit to the amount of multi word spelling items to be added to spelling.txt because of performance reasons. But spell check warnings are the most reported LT message, and often on proper names of companies, sports clubs etc. Combinations for which one would not want the individual words as correct in the spell checker, since only the combination is correct. So I feel that when the current one-word spell-checker thinks there is an error, a retry on multi-words should be attempted. It could simply be largely the same code, but for multiple tokens. In fact, I think spell checking should/could be extended to longer ranges of tokens.

danielnaber commented 4 years ago

Idea: use AhoCorasickDoubleArrayTrie instead of disambiguator to find ranges to be ignored (note to self: branch faster-spelling-for-phrases).