Open kostyfisik opened 7 years ago
Prepared n-gram for Russian can downloaded from https://languagetool.org/download/ngram-data/untested/ngram-ru-20150914.zip Rules for n-gram are in https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ru/src/main/resources/org/languagetool/resource/ru/confusion_sets.txt Other useful info: http://wiki.languagetool.org/finding-errors-using-n-gram-data
wiki says that it is only available for English and German at the moment. What shoud be changed to add official Russian ngram support?
Russian ready to use ngram but confusion_sets.txt is empty. Must be specified at least one rule for Ngram for https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ru/src/main/resources/org/languagetool/resource/ru/confusion_sets.txt
I also suggest adding some Russian ngrams before we add it as an officially supported language on the Wiki.
Can anyone check pairs, dо they fit ngam pattern? (mostly inspired with ispell rare.koi dictionary ftp://scon155.phys.msu.su/pub/russian/ispell/rus-ispell.tgz ) дисконт-дискант шассе-шоссе шассе-шасси шасси-шоссе солить-салить стелярный-столярный бораны-бараны доильный-двоильный давильный-доильный адоптивный-адаптивный серпантин-серпентин
I'll try do it.
I get for шоссе-шасси Factor: 10 - 11 false positives, 36 false negatives шоссе; шасси; 10; # p=0.845, r=0.625, 48+48, 3grams, 2016-10-12
Factor: 100 - 4 false positives, 43 false negatives шоссе; шасси; 100; # p=0.930, r=0.552, 48+48, 3grams, 2016-10-12
Factor: 1000 - 1 false positives, 63 false negatives шоссе; шасси; 1000; # p=0.971, r=0.344, 48+48, 3grams, 2016-10-12
Factor: 10000 - 1 false positives, 71 false negatives шоссе; шасси; 10000; # p=0.962, r=0.260, 48+48, 3grams, 2016-10-12
Factor: 100000 - 1 false positives, 78 false negatives шоссе; шасси; 100000; # p=0.947, r=0.188, 48+48, 3grams, 2016-10-12
Factor: 1000000 - 1 false positives, 79 false negatives шоссе; шасси; 1000000; # p=0.944, r=0.177, 48+48, 3grams, 2016-10-12
Factor: 10000000 - 0 false positives, 84 false negatives шоссе; шасси; 10000000; # p=1.000, r=0.125, 48+48, 3grams, 2016-10-12
Does it mean that ngrams do not work for rare words? Or Factor: 1000 - 1 false positives, 63 false negatives
is a good reslt?
I think it is good result. I add this: шоссе; шасси; 10000000; # p=1.000, r=0.125, 48+48, 3grams, 2016-10-12 to confusion_sets.txt But this rule is not detect all mistakes with "шоссе; шасси".
I want try with "не, ни" words.
I was sure that не\ни has a well defined usage http://www.evartist.narod.ru/text1/38.htm
I think rule for "не, ни" must placed in grammar.xml, but I want compare ngram rule to standard rule.
It seems that n-grams are very usefull, in Russian there are many rare words that are excluded from spellcheck dictionaries to provide more matches for possible errors (like rare.koi file from http://scon155.phys.msu.su/~swan/orthography.html , e.g. шоссе-шассе).
How can we add Russian n-grams support to LT? Russian n-grams are present at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html