languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.5k stars 1.4k forks source link

shot/short, with/withe pair and other rare words #609

Closed kostyfisik closed 7 years ago

kostyfisik commented 8 years ago

WRONG: It is rather shot CORRECT: It is rather short

May be it can be done with ngrams in general. Probably rather shot is always an error.

kostyfisik commented 8 years ago

One more pair: withe /with Example: A problem withe the manuscript.

danielnaber commented 8 years ago

I've added short/short. It detects It is rather shot. as an error. In general, it only finds ~ 27% of errors. For withe/with, I don't even find enough example sentences with withe to evaluate the pair.

kostyfisik commented 8 years ago

withe was a typo passed with LT spell-checker, it should be somehow corrected. screenshot from 2016-11-16 21-05-08

danielnaber commented 8 years ago

Maybe we shouldn't accept it at all, considering how rare it is? Maybe @MikeUnwalla has an opinion on this.

kostyfisik commented 8 years ago

Does LT has a special warning for usage of extremely rare words? Probably 1-gram statistics?

danielnaber commented 8 years ago

No, although that could be developed of course. But would then every word need to be looked up? The lookup is fast, but that might still be too much.

MikeUnwalla commented 8 years ago

'Withe' is a rare word. I had to look in a dictionary to find its meaning. We should show the error, and live with the false positives when someone writes and means, "The withe is ..."

kostyfisik commented 8 years ago

I think it can be a one-time extraction into a small dictionary rare-words having a popular counterpart at Levenshtein distance equals one (and, if it is not too big, two). A warning with possible confusion should be raised for words from such a dictionary (wich should be rather fast).

kostyfisik commented 8 years ago

It is quite a universal principle that should fit any European language (any LT supported language?) with 1-gramm data.

kostyfisik commented 8 years ago

Or it should be Damerau–Levenshtein distance = 1 as it is used in ispell

kostyfisik commented 8 years ago

A smart option: if the replacement of a rare word leads to a detection of some other rule error - do not provide a warning about a rare word.

TiagoSantos81 commented 8 years ago

Or it should be Damerau–Levenshtein distance = 1 as it is used in ispell

That was what I was looking for in another thread. Many thanks!

danielnaber commented 8 years ago

I've committed org.languagetool.dev.RareWordsFinder. Turns out it's not immediately useful, as even a "rare" word like "withe" has 20,000 occurrences in the Google ngram corpus.

danielnaber commented 7 years ago

A problem withe the manuscript. is detected now, although by a different rule... but still, I'll close this issue, as I don't see a general solution.

kostyfisik commented 7 years ago

Probably it is worth to add tag "Unsolved" for this and other cases without any reasonable solution available right now.