Mistagging of homographic, sentence-initial verbs

As mentioned by Horbach et al. (2015, p. 44), sentence-initial verbs are frequently in CMC-data and are often mistagged as nouns by standard tools. I checked the behavior of SoMeWeTa with the german_web_social_media_2018-12-21.model and noted that it does a real good job in recognizing these kinds of verbs. The example provided in Horbach et al. (2015, p. 44) is in fact a tricky one:

Blicke da auch nicht so richtig durch und habe Probleme damit

Blicke is homographic to the German plural of 'der Blick' but is meant as first person singular of the German verb 'blicken' in the example. In this case, also SoMeWeTa mistags it as a noun. This seems to be true for some of the rare cases of homographic sentence-initial verbs. (For being precise: They have to be homographic with a token of another part-of-speech subcategory.)

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#To Do: update path to your model here
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from introspectiv generated German sentences
sentences = ["Blicke da auch nicht durch.",
             "Check ich auch nicht.",
             "Schau mir das morgen an.",
             "Trank kurz den Tee fertig."]

for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    print(tagged_sentence)

If you run the above code the output will be:

incorrect for: [('Blicke', 'NN'), ('da', 'ADV'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('durch', 'PTKVZ'), ('.', '$.')]
correct for: [('Check', 'VVFIN'), ('ich', 'PPER'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('.', '$.')]
correct for: [('Schau', 'VVIMP'), ('mir', 'PPER'), ('das', 'ART'), ('morgen', 'NN'), ('an', 'PTKVZ'), ('.', '$.')]
correct for: [('Trank', 'VVFIN'), ('kurz', 'ADJD'), ('den', 'ART'), ('Tee', 'NN'), ('fertig', 'ADJD'), ('.', '$.')]

The homographs of my examples are the following nouns: 'der Check', 'die Schau' and 'der Trank' As you can see from the example above only the example sentence of Horbach et al. seems to be affected. All other test sentences have been tagged correctly. I have not yet discovered a system for the failure. As this is not part of the documented errors of SoMeWeTa (Proisl, 2018, p. 667) I considered it as an issue.

Sources:

Horbach, Andrea / Thater, Stefan / Steffen, Diana / Fischer, Peter M. / Witt, Andreas und Pinkal, Manfred (2015). Internet Corpora: A Challenge for Linguistic Processing. In: Datenbank-Spektrum, 15(1), 41–47. https://doi.org/10.1007/s13222-014-0172-z
Proisl, Thomas (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In: European Language Resources Association (ELRA) (Hrsg.), Proceedings of the 11th Language Resources and Evaluation Conference (S. 665–670). Miyazaki, Japan: European Language Resource Association. Abgerufen von https://www.aclweb.org/anthology/L18-1106

tsproisl / SoMeWeTa

Mistagging of homographic, sentence-initial verbs #7