languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.39k stars 1.39k forks source link

Improve and reduce spelling suggestions #3085

Open ghost opened 4 years ago

ghost commented 4 years ago

For shorter mistakes, the list of suggestions by the morfologik speller are too many and far off. How can I improve this? The method I know would be to move them to the ignore list, and make a simple replace with the correct suggestions. Is that the preferred method?

danielnaber commented 4 years ago

Can you provide some examples?

ghost commented 4 years ago

wèl suggests a lot, except the correct wel and wél e.g. éen ditto, should just be een and één. -Ze does not suggest - Ze In general, in the suggestion list rendered by the server, there are too many suggestions for short words.

From 5 letters up, suggestions are a lot better.

danielnaber commented 4 years ago

Can you see whether this is limited to words with special characters? Also, does LibreOffice (which uses hunspell) have the same issue?

ghost commented 4 years ago

Hunspell does not have this. (Made the .aff myself)

danielnaber commented 4 years ago

wèl suggests a lot, except the correct wel and wél

It suggests wel for me, but wél is indeed missing. At first sight, this might even be a bug in Morfologik? @jaumeortola knows that code a bit, do you have an idea? nl_NL.info looks okay to me.

jaumeortola commented 4 years ago

I dumped the file spelling/nl_NL.dict, and "wél" is not there. When "wél" is added to nl/spelling/spelling.txt, it appears as the third suggestion.

2.) Line 1, column 1, Rule ID: MORFOLOGIK_RULE_NL_NL
Message: Er is een mogelijke spelfout gevonden.
Suggestion: Wel; wel; wél; Bel; El; Gel; Hel; Nel; Pel; Tel; Wal; Weg; Well; Wiel; Wil; bel; cel; del; el; fel; gel; hel; kwel; rel; tel; vel; wal; we; web; wee; weg; wei; welk; wen; wet; wiel; wil; wol; Mel; Wee; wed; wek; Wei; Welt; Wen; Wol; pel; Weel; Weil; Wes; iel; Wehl; weel; wel-; Wely; sel; welp; welt; Wels; lel; nel; Sel; Wael; awel; Welp; Kel; kel; zwel; Jel; Oel; wes; Weyl; woel; Wey; Zel; wep; yel; Owel; wem; Welz; welf; Yel
wèl 
^^^ 

And it is here: spelling/ignore.txt:wél.

ghost commented 4 years ago

Which proves my point, because wel and Wel should be the only suggestions.

danielnaber commented 4 years ago

Which proves my point, because wel and Wel should be the only suggestions.

We don't have a logic yet to stop when good suggestions are found. The algorithm will just keep searching for more candidates. It's on the wishlist, though.

jaumeortola commented 4 years ago

I was confused because you said: "except the correct wel and wél". I understood "wél" was correct.

When the difference is only a diacritic mark and fsa.dict.speller.ignore-diacritics=true, then distance=0 and the word comes first in the suggestion list.

If you want to cut the list, when the difference is only a diacritic mark, that is trivial. We could add that condition in MorfologikDutchSpellerRule.

Anyway, the long list is seen only by developers. Usually users only see 5 suggestions (depending on the user interface).

jaumeortola commented 4 years ago

In the current spelling configuration, it seems that, even when there are many suggestions with distance=1, we keep searching for suggestions with distance=2. Many times it seems unnecessary. Is that the desired behavior, @danielnaber?

danielnaber commented 4 years ago

we keep searching for suggestions with distance=2.

Do you mean this code in MorfologikSpellerRule?

if (word.length() >= 3 && (onlyCaseDiffers || fullResults || defaultSuggestions.isEmpty())) {

Stopping early is good for performance, but the suggestions with a larger distance might still be good. They might even be moved to top in a later re-sorting step (currently only for English). So changing the code would need quite some testing to prevent regressions, I think.

ghost commented 3 years ago

For shorter words, a manual list might be better.