languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.38k stars 1.39k forks source link

[ru] Accented vowels should be treated as normal for correct tagging. #526

Closed kostyfisik closed 2 years ago

kostyfisik commented 8 years ago

As soon as there is no regular way to input accented vowels, they should be auto replaced with normal vowels before morphologic speller and POS tagging.

No tag found: Бóльшую - -

Same word: Большую больший большой больший ADJ:Fem:V ADJ:Fem:V ADJ_Com:Fem:V SENT_END

yakovru commented 8 years ago

It may be useful for other languages too, not only Russian.

kostyfisik commented 8 years ago

@danielnaber Is it possible to define equality table for stressed and unstressed letters? I am not sure that is is a full story, however, it seems to be relative to English too. Find 'poetry' at wiki page for examples https://en.wikipedia.org/wiki/English_terms_with_diacritical_marks diacritics only change the pronunciation and nothing from a grammatic point of view.

danielnaber commented 8 years ago

I don't think we have such a feature in the tagger. BaseTagger would probably need to be modified.

kostyfisik commented 8 years ago

@danielnaber Take in mind the following point - while

б`ольшую and больш`ую

(accent should be above the following letter) are the same word (in the most cases nobody cares and there is no simple way to input it on Russian keyboard ) actually they have different POS tags. So if there is a defined POS tag for a word with a stressed vowel - it should be used. If there are no tags - try to remove all accents and you will probably find tags for both types of accent. The route of such a tradition is due to the fact that most of the computer Russian dictionaries do not place an accent. However, in some rare cases this can be critical, so peoples spend some time to find how it can be done in the text editor. It will be pity if for this critial case the LT will miss some stupid typo just because it was not able to distinguish the POS tag...

danielnaber commented 8 years ago

Feel free to implement the suggested behavior in RussianTagger, probably by overwriting getAnalyzedTokens(). If it doesn't find a result, it could call itself with a normalized version of its input.

kostyfisik commented 8 years ago

I see. The advanced tagger was implemented for German here https://github.com/languagetool-org/languagetool/blob/18ec48494986462071e9de58c7636a35ad2cd4d1/languagetool-language-modules/de/src/main/java/org/languagetool/tagging/de/GermanTagger.java and for Ukrain here https://github.com/languagetool-org/languagetool/blob/6443a35ad9967c1d3b09b0184c8afea9ac61047b/languagetool-language-modules/uk/src/main/java/org/languagetool/tagging/uk/UkrainianTagger.java

The Russian tagger is really a basic one here https://github.com/languagetool-org/languagetool/blob/6854fe580f773ee2f70e34ee9f0eac97fcc837a8/languagetool-language-modules/ru/src/main/java/org/languagetool/tagging/ru/RussianTagger.java same for English https://github.com/languagetool-org/languagetool/blob/6854fe580f773ee2f70e34ee9f0eac97fcc837a8/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java

However, the advanced tagging code does not make a lot of sense for me. Documentation here http://wiki.languagetool.org/developing-a-tagger-dictionary covers some other aspect of tagging, and I was not able to find any documentation in developers section of the wiki to give a general description of how LT performs checks, particularly what it the workflow for the POS tags. (My wild guess is that POS tags is just a list of strings that is linked to each token during processing. This should make postag_regex implementation to be a simple iteration over a list. However, it is not clear what is the typical order of analysis. Does is goes with a sentence of with a paragraph as a basic analysed unit (or may be several paragraphs as a speed optimization, which I will do for C++ development to achieve the best data cache-hit ratio). So what is the memory model? Which data fields are mostly important and how they are organized (classes, database, files, etc.?) What are the sources for POS tages (regular dict, added.txt and removed.txt, code in XXXTagger, smth else?) and in which order do they apply.

danielnaber commented 8 years ago

The general approach is documented at http://wiki.languagetool.org/development-overview#toc3. POS tags are indeed just strings, they are in AnalyzedToken, which again is part of AnalyzedTokenReadings (see javadoc). However, you shouldn't need to know this as long as you just modify the BaseTagger sub class.

arysin commented 8 years ago

We have good support for accented characters in Ukrainian. You can take a look. We can move some of it up if needed.

Andriy

On Sep 2, 2016 5:49 AM, "Daniel Naber" notifications@github.com wrote:

The general approach is documented at http://wiki.languagetool.org/ development-overview#toc3. POS tags are indeed just strings, they are in AnalyzedToken, which again is part of AnalyzedTokenReadings (see javadoc https://languagetool.org/development/api/index.html?org/languagetool/JLanguageTool.html). However, you shouldn't need to know this as long as you just modify the BaseTagger sub class.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/languagetool-org/languagetool/issues/526#issuecomment-244332591, or mute the thread https://github.com/notifications/unsubscribe-auth/AFPnXCJSbPGC91SVSInDJ6Zv6nCGClzRks5ql_EpgaJpZM4JyaCR .

kostyfisik commented 8 years ago

This is a copy-paste from https://github.com/languagetool-org/languagetool/blob/7c70acb951afba2da89c2e577180c76b1383952b/languagetool-language-modules/ca/src/main/java/org/languagetool/tagging/ca/CatalanTagger.java simplified and changed to fit Russian. Is it safe to put in into RussianTagger? I still do not understand how it works but I have a feeling that it can work. As soon as it is a bad practice to code by feeling without complete understanding of what is going on I would like to ask someone more skilled to review the code before putting it into the source tree

  @Override
  public List<AnalyzedTokenReadings> tag(final List<String> sentenceTokens)
      throws IOException {

    final List<AnalyzedTokenReadings> tokenReadings = new ArrayList<>();
    int pos = 0;

    for (String word : sentenceTokens) {
      final List<AnalyzedToken> l = new ArrayList<>();
      List<AnalyzedToken> taggerTokens = asAnalyzedTokenListForTaggedWords(word, getWordTagger().tag(word));
      addTokens(taggerTokens, l);
      if (l.isEmpty()) {
        word = word.toLowerCase(conversionLocale);
        // This hack allows all rules and dictionary entries to work with stress over vowel
        if (word.length() > 1) {
          word = word.replace("ó", "о");
          word = word.replace("á", "а");
          word = word.replace("é", "е");
          word = word.replace("ý", "у");
        }
        List<AnalyzedToken> taggerTokens = asAnalyzedTokenListForTaggedWords(word, getWordTagger().tag(word));
        addTokens(taggerTokens, l);
        if (l.isEmpty()) {
          l.add(new AnalyzedToken(word, null, null));
        } 
      }
      AnalyzedTokenReadings atr = new AnalyzedTokenReadings(l, pos);
      tokenReadings.add(atr);
      pos += word.length();
    }
    return tokenReadings;
  }
kostyfisik commented 8 years ago

@arysin I had found the Uk code but I was not able to locate accent support. See the code above for the approach that looks to be cleaner (at least for me).

arysin commented 8 years ago

See https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/java/org/languagetool/language/Ukrainian.java

@Override public Pattern getIgnoredCharactersRegex() { return Pattern.compile("[\u00AD\u0301]"); }

can't get much cleaner than that :)

2016-09-02 7:28 GMT-04:00 Konstantin Ladutenko notifications@github.com:

@arysin https://github.com/arysin I had found the Uk code but I was not able to locate accent support. See the code above for the approach that looks to be cleaner (at least for me).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/languagetool-org/languagetool/issues/526#issuecomment-244350767, or mute the thread https://github.com/notifications/unsubscribe-auth/AFPnXBOo4aa_5hefD0PHE0Q850R8xEsbks5qmAhCgaJpZM4JyaCR .

kostyfisik commented 8 years ago

Ignoring is actually what happens at the moment. The idea is to provide POS tags for accented word: if it there is a defined POS - keep it as it is, if the POS tag list is empty - try to fill it with tags of a non-accented word.

arysin commented 8 years ago

The feature is not ignoring the word, the feature is ignoring the characters in the word so the words with accented characters will be treated as without them. If you only implement ignoring accents for tagging the disambiguator and rules won't benefit from it when matching tokens. If you use getIgnoredCharactersRegex() all parts of the code will treat accented words like unaccented, e.g. the rule that has Большую will also catch Бóльшую.

arysin commented 8 years ago

See https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/resources/org/languagetool/rules/uk/grammar-spelling.xml#L123 Both пів-Європи and пів-Євро́пи are caught by POS tag in the rule, even though Ukraianian dictionary does not have accented words.

kostyfisik commented 8 years ago

@arysin Good! This can be a simple solution. Does it cover all accents ó á и́ é ý ?
If we add to the POS dictionary that Бóльшую is only an ADJ_Com:Fem:V (from lemma больший) Большýю is only an ADJ:Fem:V (from lemma большой) And Большую can have both POS tags (as soon as without a stress it is not clear) How will does this getIgnoredCharactersRegex() will behave? For this word, that has accents in the dictionary - will it leave the only tag for an accented word? Or it will provide both tags for all three cases? As a professional tool, LT should be able to treat accents (if they are present), not just ignoring them. As for Russian, it will happen as soon as accented dictionary will be added.

arysin commented 8 years ago

For Ukrainian we just remove all accent chars, but it's a regex so you can write it in the way that it removes regex after the vowels.

BTW in your example above only и is really properly accented (using U+0301), other 4 vovews incorrectly represented by Latin umlauts. If you want to support this umlauts for accent case you'd need different solution (but beware you'd not be able to find such words in rules by the text). For Ukrainian we actully trigger mixture of Latin/Cyrillic characters in words as a separate error (and don't try to tag such words at all).

Currently getIgnoredCharactersRegex() does not support trying ignored symbols in the dictionary (this can be changed of course). In reality though even if you add accented words in the dictionary most of the cyrillic texts out there don't have accents, so you can't rely on accents to make the right tagging.

kostyfisik commented 8 years ago

It looks that you are right. Putting you regexp seems to be the best solution at the moment (at least until someone tries to verify accented Russian corpus with LT). The regex can be extended with U+300 which seems to be legal in compound words like о̀колозе́мный with two accents (main and additional one).

Could you please give a link for you Latin/Cyrillic rule? I tried it recently (see two rules after https://github.com/languagetool-org/languagetool/blob/73ac458b70e745761d25d7469daf0405a2d53b10/languagetool-language-modules/ru/src/main/resources/org/languagetool/rules/ru/grammar.xml#L6894 ) it still gives some false positives for single letter words.

kostyfisik commented 8 years ago

I also had to extend multi-letter rule with letters from other languages which use Cyrillic letters (like Kazakh or old Slovenian).

arysin commented 8 years ago

Detecting mixed alphabets was easy enough for xml rule but recommending the correction is quite hard without code so I wrote the whole rule in Java: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/java/org/languagetool/rules/uk/MixedAlphabetsRule.java

2016-09-02 13:53 GMT-04:00 Konstantin Ladutenko notifications@github.com:

It looks that you are right. Putting you regexp seems to be the best solution at the moment (at least until someone tries to verify accented Russian corpus with LT). The regex can be extended with U+300 which seems to be legal in compound words like о̀колозе́мный with two accents (main and additional one).

Could you please give a link for you Latin/Cyrillic rule? I tried it recently (see two rules after https://github.com/ languagetool-org/languagetool/blob/73ac458b70e745761d25d7469daf04 05a2d53b10/languagetool-language-modules/ru/src/main/ resources/org/languagetool/rules/ru/grammar.xml#L6894 ) it still gives some false positives for single letter words.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/languagetool-org/languagetool/issues/526#issuecomment-244444160, or mute the thread https://github.com/notifications/unsubscribe-auth/AFPnXAp2mzMegxX0Orw8T5QBlbg90eU2ks5qmGKFgaJpZM4JyaCR .

kostyfisik commented 8 years ago

Cool! While it looks too hard for me I believe that my implementation gives less false positives when some e.g. Kazakh word is used with a mixture of letters from Russian and Kazakh alphabets (I checked against over 10 mln sentences from ru wiki to find out false positives).

Your implementation of Roman numerals is not full - use IVXLCDM for a list of commonly used symbols (mostly found for a date of establishment of copyright)

kostyfisik commented 8 years ago

I tried to update with ignore rule https://github.com/languagetool-org/languagetool/pull/528/files however, it only works a half way - it does not cause errors.... But if I try to put

        <example correction=""><marker>Желе́зные кровать</marker>.</example>

to the rule the checkrules returns an error

Running pattern rule tests for Russian... Exception in thread "main" java.lang.AssertionError: Russian: Incorrect match position markup (end) for rule Unify_Adj_NN_number[1], sentence: Желе́зные кровать. expected:<17> but was:<16>
    at org.junit.Assert.fail(Assert.java:88)
    at org.junit.Assert.failNotEquals(Assert.java:834)
    at org.junit.Assert.assertEquals(Assert.java:645)
    at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:338)
    at org.languagetool.rules.patterns.PatternRuleTest.testGrammarRulesFromXML(PatternRuleTest.java:273)
    at org.languagetool.rules.patterns.PatternRuleTest.runTestForLanguage(PatternRuleTest.java:198)
    at org.languagetool.rules.patterns.PatternRuleTest.runGrammarRulesFromXmlTestIgnoringLanguages(PatternRuleTest.java:149)
    at org.languagetool.rules.patterns.PatternRuleTest.main(PatternRuleTest.java:558)
Running disambiguator rule tests...

the neighboring

                <example correction=""><marker>Железные кровать</marker>.</example>

work fine...

yakovru commented 2 years ago

Fixed for Russian. Closed.

kostyfisik commented 2 years ago

@yakovru @arysin Thank you for your help and support in this!