Open danielnaber opened 7 years ago
With the ngram data, we have a larger corpus and can also consider the context.
This sounds extremely promising! A suggestion mechanism that takes into account both word frequency and direct neighbors is very likely to produce good results.
Two examples that show that something has to be done here: 'Mindeshöhe', 'Jezt'. The best suggestions are so low in the list of suggestions they are not even shown to the users in the UI. Both could be solved by preferring words that are in the larger binary dictionary, or preferring words that start with the same character. I think the latter is often done by spell checkers, because users get the first letter right most of the time.
Before or after sorting, the list of suggestion should also be deduplicated. It looks strange when items appear twice. The duplicates are probably introduced by getAdditionalTopSuggestions()
.
Duplicate filtering should now be fixed.
Actually, the word frequency is already used directly by morfologik in Speller.java:
this.distance = distance * FREQ_RANGES + FREQ_RANGES - getFrequency(word) - 1
In case we also consider frequency information, including context, we'll need to set fsa.dict.frequency-included=false
in de_DE.info
.
Suggestions for "Hauk" are: "Hauch; Haus; Hank; Hauke; Haut; ..."
The reason that "Hauch" is first is because of
REPL
inGermanSpellerRule
- "ch" and "k" are considered a confusion pair. The problem seems to be that we don't consider the occurrence data (inside the *.dict) after that, so "Hauch" ends up as the first suggestion even though it is less common than "Haus" (it also has a larger Levenshtein distance, but only before consideringREPL
).Solution: clean up the whole suggestion code: first find candidates, then properly sort them using the ngram data. We shouldn't rely at all on de_wordlist.xml because it seems to contain some ugly issues. For example, it contains "haut" but not "Haut", "rennen" but not "Rennen". So it seems to be only partially case-sensitive. With the ngram data, we have a larger corpus and can also consider the context.