languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.4k stars 1.39k forks source link

Improve sorting of suggestions #732

Open danielnaber opened 7 years ago

danielnaber commented 7 years ago

Suggestions for "Hauk" are: "Hauch; Haus; Hank; Hauke; Haut; ..."

The reason that "Hauch" is first is because of REPL in GermanSpellerRule - "ch" and "k" are considered a confusion pair. The problem seems to be that we don't consider the occurrence data (inside the *.dict) after that, so "Hauch" ends up as the first suggestion even though it is less common than "Haus" (it also has a larger Levenshtein distance, but only before considering REPL).

Solution: clean up the whole suggestion code: first find candidates, then properly sort them using the ngram data. We shouldn't rely at all on de_wordlist.xml because it seems to contain some ugly issues. For example, it contains "haut" but not "Haut", "rennen" but not "Rennen". So it seems to be only partially case-sensitive. With the ngram data, we have a larger corpus and can also consider the context.

janschreiber commented 7 years ago

With the ngram data, we have a larger corpus and can also consider the context.

This sounds extremely promising! A suggestion mechanism that takes into account both word frequency and direct neighbors is very likely to produce good results.

janschreiber commented 7 years ago

Two examples that show that something has to be done here: 'Mindeshöhe', 'Jezt'. The best suggestions are so low in the list of suggestions they are not even shown to the users in the UI. Both could be solved by preferring words that are in the larger binary dictionary, or preferring words that start with the same character. I think the latter is often done by spell checkers, because users get the first letter right most of the time.

janschreiber commented 7 years ago

Before or after sorting, the list of suggestion should also be deduplicated. It looks strange when items appear twice. The duplicates are probably introduced by getAdditionalTopSuggestions(). 2017-07-13-105832_1024x768_scrot

danielnaber commented 7 years ago

Duplicate filtering should now be fixed.

danielnaber commented 7 years ago

Actually, the word frequency is already used directly by morfologik in Speller.java:

this.distance = distance * FREQ_RANGES + FREQ_RANGES - getFrequency(word) - 1

In case we also consider frequency information, including context, we'll need to set fsa.dict.frequency-included=false in de_DE.info.