Closed kostyfisik closed 7 years ago
One more pair: withe /with Example: A problem withe the manuscript.
I've added short/short. It detects It is rather shot.
as an error. In general, it only finds ~ 27% of errors. For withe/with, I don't even find enough example sentences with withe
to evaluate the pair.
withe
was a typo passed with LT spell-checker, it should be somehow corrected.
Maybe we shouldn't accept it at all, considering how rare it is? Maybe @MikeUnwalla has an opinion on this.
Does LT has a special warning for usage of extremely rare words? Probably 1-gram statistics?
No, although that could be developed of course. But would then every word need to be looked up? The lookup is fast, but that might still be too much.
'Withe' is a rare word. I had to look in a dictionary to find its meaning. We should show the error, and live with the false positives when someone writes and means, "The withe is ..."
I think it can be a one-time extraction into a small dictionary rare-words
having a popular counterpart at Levenshtein distance equals one (and, if it is not too big, two). A warning with possible confusion should be raised for words from such a dictionary (wich should be rather fast).
It is quite a universal principle that should fit any European language (any LT supported language?) with 1-gramm data.
Or it should be Damerau–Levenshtein distance = 1 as it is used in ispell
A smart option: if the replacement of a rare word leads to a detection of some other rule error - do not provide a warning about a rare word.
Or it should be Damerau–Levenshtein distance = 1 as it is used in ispell
That was what I was looking for in another thread. Many thanks!
I've committed org.languagetool.dev.RareWordsFinder
. Turns out it's not immediately useful, as even a "rare" word like "withe" has 20,000 occurrences in the Google ngram corpus.
A problem withe the manuscript.
is detected now, although by a different rule... but still, I'll close this issue, as I don't see a general solution.
Probably it is worth to add tag "Unsolved" for this and other cases without any reasonable solution available right now.
WRONG: It is rather shot CORRECT: It is rather short
May be it can be done with ngrams in general. Probably
rather shot
is always an error.