tsproisl / SoMeWeTa

A part-of-speech tagger with support for domain adaptation and external resources.
GNU General Public License v3.0
22 stars 3 forks source link

Mistagging of homographic, sentence-initial verbs #7

Open vizzerdrix55 opened 4 years ago

vizzerdrix55 commented 4 years ago

As mentioned by Horbach et al. (2015, p. 44), sentence-initial verbs are frequently in CMC-data and are often mistagged as nouns by standard tools. I checked the behavior of SoMeWeTa with the german_web_social_media_2018-12-21.model and noted that it does a real good job in recognizing these kinds of verbs. The example provided in Horbach et al. (2015, p. 44) is in fact a tricky one:

Blicke da auch nicht so richtig durch und habe Probleme damit

Blicke is homographic to the German plural of 'der Blick' but is meant as first person singular of the German verb 'blicken' in the example. In this case, also SoMeWeTa mistags it as a noun. This seems to be true for some of the rare cases of homographic sentence-initial verbs. (For being precise: They have to be homographic with a token of another part-of-speech subcategory.)

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#To Do: update path to your model here
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from introspectiv generated German sentences
sentences = ["Blicke da auch nicht durch.",
             "Check ich auch nicht.",
             "Schau mir das morgen an.",
             "Trank kurz den Tee fertig."]

for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    print(tagged_sentence)

If you run the above code the output will be:

The homographs of my examples are the following nouns: 'der Check', 'die Schau' and 'der Trank' As you can see from the example above only the example sentence of Horbach et al. seems to be affected. All other test sentences have been tagged correctly. I have not yet discovered a system for the failure. As this is not part of the documented errors of SoMeWeTa (Proisl, 2018, p. 667) I considered it as an issue.

Sources: