languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
11.87k stars 1.37k forks source link

[ru]n-grams #569

Open kostyfisik opened 7 years ago

kostyfisik commented 7 years ago

It seems that n-grams are very usefull, in Russian there are many rare words that are excluded from spellcheck dictionaries to provide more matches for possible errors (like rare.koi file from http://scon155.phys.msu.su/~swan/orthography.html , e.g. шоссе-шассе).

How can we add Russian n-grams support to LT? Russian n-grams are present at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

yakovru commented 7 years ago

Prepared n-gram for Russian can downloaded from https://languagetool.org/download/ngram-data/untested/ngram-ru-20150914.zip Rules for n-gram are in https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ru/src/main/resources/org/languagetool/resource/ru/confusion_sets.txt Other useful info: http://wiki.languagetool.org/finding-errors-using-n-gram-data

kostyfisik commented 7 years ago

wiki says that it is only available for English and German at the moment. What shoud be changed to add official Russian ngram support?

yakovru commented 7 years ago

Russian ready to use ngram but confusion_sets.txt is empty. Must be specified at least one rule for Ngram for https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ru/src/main/resources/org/languagetool/resource/ru/confusion_sets.txt

danielnaber commented 7 years ago

I also suggest adding some Russian ngrams before we add it as an officially supported language on the Wiki.

kostyfisik commented 7 years ago

Can anyone check pairs, dо they fit ngam pattern? (mostly inspired with ispell rare.koi dictionary ftp://scon155.phys.msu.su/pub/russian/ispell/rus-ispell.tgz ) дисконт-дискант шассе-шоссе шассе-шасси шасси-шоссе солить-салить стелярный-столярный бораны-бараны доильный-двоильный давильный-доильный адоптивный-адаптивный серпантин-серпентин

yakovru commented 7 years ago

I'll try do it.

yakovru commented 7 years ago

I get for шоссе-шасси Factor: 10 - 11 false positives, 36 false negatives шоссе; шасси; 10; # p=0.845, r=0.625, 48+48, 3grams, 2016-10-12

Factor: 100 - 4 false positives, 43 false negatives шоссе; шасси; 100; # p=0.930, r=0.552, 48+48, 3grams, 2016-10-12

Factor: 1000 - 1 false positives, 63 false negatives шоссе; шасси; 1000; # p=0.971, r=0.344, 48+48, 3grams, 2016-10-12

Factor: 10000 - 1 false positives, 71 false negatives шоссе; шасси; 10000; # p=0.962, r=0.260, 48+48, 3grams, 2016-10-12

Factor: 100000 - 1 false positives, 78 false negatives шоссе; шасси; 100000; # p=0.947, r=0.188, 48+48, 3grams, 2016-10-12

Factor: 1000000 - 1 false positives, 79 false negatives шоссе; шасси; 1000000; # p=0.944, r=0.177, 48+48, 3grams, 2016-10-12

Factor: 10000000 - 0 false positives, 84 false negatives шоссе; шасси; 10000000; # p=1.000, r=0.125, 48+48, 3grams, 2016-10-12

kostyfisik commented 7 years ago

Does it mean that ngrams do not work for rare words? Or Factor: 1000 - 1 false positives, 63 false negatives is a good reslt?

yakovru commented 7 years ago

I think it is good result. I add this: шоссе; шасси; 10000000; # p=1.000, r=0.125, 48+48, 3grams, 2016-10-12 to confusion_sets.txt But this rule is not detect all mistakes with "шоссе; шасси".

yakovru commented 7 years ago

I want try with "не, ни" words.

kostyfisik commented 7 years ago

I was sure that не\ни has a well defined usage http://www.evartist.narod.ru/text1/38.htm

yakovru commented 7 years ago

I think rule for "не, ни" must placed in grammar.xml, but I want compare ngram rule to standard rule.