Open tiff opened 4 years ago
This is caused by https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java#L440-L450, which calculates the candidates with the larger edit distances only when there are no matches for edit distance == 1. Originally introduced to improve performance. I will see how much difference it really still makes.
When computing edit distance, is there a way to make double letter differences cheaper in term of edit edit cost than other letter differences? At least in French or English, it's frequent to make typos with double letter because we don't hear the difference (unlike in in Italian where double letters sound differently). For example, the word "occurrence" is often misspelled as as "occurence", or "ocurrence", or "ocurence. Ideally, the suggestions should prefer words with typos that have only diacritics, or double letters differences, rather than other kinds of typos.
According to PerformanceTest2
, always calculating all suggestions could make text checking 30% slower. The test is somewhat artificial, though.
is there a way to make double letter differences cheaper in term of edit edit cost than other letter differences
fsa.dict.speller.replacement-pairs
could maybe be used for that. Indeed, adding rr r,f ff
to that list improves the suggestion for this issue, but without a more complete evaluation, there's also the risk of other suggestions getting worse.
This ticket looks related to issue #922 where I posted the comment (among other things):
double letter errors should not count too much i.e. "usefull" and "useful" should be close in distance.