languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.29k stars 1.39k forks source link

multi word spell checker #1593

Open ghost opened 5 years ago

ghost commented 5 years ago

I am suggesting to add a kind of multi word spellchecker. It takes any group of words (1, 2, 3, ....n) words (or tokens) and checks if it is in the 'common word group' list (ngram data?) (ngram data is now limited to 4 tokens; for this function to work properly, it has to contain longer ngrams, as long as they are common enough) If so, it is not reported. If it is not, it has to look for similar word groups that are. Spaces and interpunction are to be considered characters that could be dropped or added as well. So it is a lot like spellchecking, but includes multiple words.

There are 2 possible rules in this: one mentioning this is an unexpected word group (there might be somethng wrong here), hitting relatively often (depending on the size of the data set and tresholds) ; and a second one, wimply stating: you might mean '', since this is much more common.

ghost commented 5 years ago

Examples: het ka der=>het kader jaar ge leden=> jaar geleden zijn ei gen=>zijn eigen

Some of the errors are from wrong unicode translation ( w o r d ) or other automatically created errors; some by typing the space bar at the wrong time.