tsproisl / SoMeWeTa

A part-of-speech tagger with support for domain adaptation and external resources.
GNU General Public License v3.0
22 stars 3 forks source link

inaccurate action word recognition #6

Open vizzerdrix55 opened 4 years ago

vizzerdrix55 commented 4 years ago

SoMeWeta uses the Tagset STTS_IBK for tagging. One of the differences between STTS and STTS_IBK is the Tag Action words (AKW), e.g. for German lach (Beißwenger, Bartz, Storrer und Westpfahl, 2015). I tested the accuracy of AKW-tagging with a small sample of tokens. As you can see from the attached results, the accuracy is about 33 %.

You can reproduce the wrong tagging with the following minimal working example containing 10 sample sentences:

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#ToDo: update path to language model
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from authentic German CMC-data
sentences = ["Also das schlägt ja wohl dem Fass den Boden aus! :haeh:",
             "das mehr oder weniger gute Dlc gabs noch gratis dazu.",
            "Aus der Liste: definitiv Brink, obwohls für kurze Zeit Spaß gemacht "
            "hat, aber im Nachhinein hab ichs doch sehr bereut.",
            "*schluchz, heul*",
            "endlich, und dann noch als standalone-addon *freu*",
            "Und immer schön mit den Holländer zocken, da gabs die besten Preise.",
            "Ich freu mich riesig und weiß was ich im Wintersemester "
            "jeden Tag machen werde!!",
            "alles oben in der liste gabs unter bf2 auch schon in einer form.",
            "Mit dem Account werden weitere Features im Online-Modus des FM11 "
            "freigeschaltet, bswp mehr Statistiken, mehr Aktionskarten, mögliche "
            "Fantasy-Liga, yadda, yadda."]

akws = list()
for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    for word in tagged_sentence:
        #append to list akws if tagged with PoS-Tag 'AKW'
        try:
            akw = (word.index('AKW'))
            akws.append(word)
        except:
            continue
print("tagged as AKW:", akws)

The output list akws contains two right action words ('heul' and 'freu'). 'Haeh' is an emoticon, 'gabs' and 'obwohls' are in fact contractions. 'bswp' is used as abbreviation for German 'beispielsweise'.

Is this serious enough to be considered as an issue or have i implemented something wrong? As far as I see, this error is not part of the error table 4 in Proisl (2018, p. 668).

Cited sources:

tsproisl commented 4 years ago

Of course, ten sentences containing three instances (with two of them belonging together) is not enough to get a robust estimate for the tagging accuracy. Nevertheless, let's take a closer look at the data: Two out of three AKWs are recognized as such (i.e. recall is ⅔), however, four non-AKWs are erroneously tagged as AKW as well (i.e. precision is ⅓). A closer analysis suggests that the main problem is data sparsity.

Of the seven word forms involved, only one is known to the tagger (freu), the others are unkown, i.e they do not occur in the training data (haeh, schluchz, heul, gabs, obwohls, bswp). Furthermore, a whole phenomenon (:haeh:) is unknown to the tagger since the training data do not contain textual representations of emoticons in this format. Another phenomenon occurs only once (*schluchz, heul*): The only instance of comma-separated action words in the training data is *rupf, zerr, reiss, mich losmach*.

How could we improve performance? Ideally by providing the tagger with more training data. A quicker solution might be a custom post-processor. If you are reasonably sure that a token between colons is always a textual representation of an emoticon and that a token between asterisks is always an action word in your data, you could assign the corresponding tags in a post-processing step. (Ideally that should be a pre-processing step, enabling the tagger to incorporate that information into the further analysis. Unfortunately, SoMeWeTa cannot tag partially annotated input at the moment – although it can be trained and evaluated on partially annotated data.) A sample post-processor for STTS_IBK is available in utils/STTS_IBK_postprocessor.

In a future version of SoMeWeTa, phenomena like the ones in that post-processor script (i.e phenomena that can be deterministically recognized with high accuracy) might be dealt with by a model-specific pre-processor that is incorporated into the tagger model.