languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.04k stars 1.38k forks source link

[en] "tarrifs" doesn't suggest "tariffs" #2659

Open tiff opened 4 years ago

tiff commented 4 years ago

Bildschirmfoto 2020-03-31 um 10 12 58

danielnaber commented 4 years ago

This is caused by https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java#L440-L450, which calculates the candidates with the larger edit distances only when there are no matches for edit distance == 1. Originally introduced to improve performance. I will see how much difference it really still makes.

dpelle commented 4 years ago

When computing edit distance, is there a way to make double letter differences cheaper in term of edit edit cost than other letter differences? At least in French or English, it's frequent to make typos with double letter because we don't hear the difference (unlike in in Italian where double letters sound differently). For example, the word "occurrence" is often misspelled as as "occurence", or "ocurrence", or "ocurence. Ideally, the suggestions should prefer words with typos that have only diacritics, or double letters differences, rather than other kinds of typos.

danielnaber commented 4 years ago

According to PerformanceTest2, always calculating all suggestions could make text checking 30% slower. The test is somewhat artificial, though.

is there a way to make double letter differences cheaper in term of edit edit cost than other letter differences

fsa.dict.speller.replacement-pairs could maybe be used for that. Indeed, adding rr r,f ff to that list improves the suggestion for this issue, but without a more complete evaluation, there's also the risk of other suggestions getting worse.

dpelle commented 3 years ago

This ticket looks related to issue #922 where I posted the comment (among other things):

double letter errors should not count too much i.e. "usefull" and "useful" should be close in distance.