markuskiller / textblob-de

German language support for TextBlob.
https://textblob-de.readthedocs.org
MIT License
103 stars 12 forks source link

PatternParserLemmatizer: tagging errors negatively affecting sentiment analysis #6

Open markuskiller opened 10 years ago

markuskiller commented 10 years ago

Tagging errors in PatternParser output may lead to incorrect lemmatization of frequent German adjectives. As a consequence of this, there will be unexpected results in all tools relying on the parser's output (pos tagging, sentiment analysis, noun phrase extraction, etc.):

Example (using ipython):


In [1]: from textblob_de import TextBlobDE
In [2]: TextBlobDE(u"Peter hat einen schönen Hund.").sentiment
Out[2]: Sentiment(polarity=0.0, subjectivity=0.0)
Out[EXPECTED]: Sentiment(polarity=1.0, subjectivity=0.0)

In [3]: TextBlobDE(u"Peter hat einen schönen Hund.").noun_phrases
Out[3]: WordList([])
Out[EXPECTED]: WordList([u'schönen Hund'])

In [4]: TextBlobDE(u"Peter hat einen schönen Hund.").tags
Out[4]: [('Peter', 'NNP'), ('hat', 'VB'), ('einen', 'DT'),  (u'schönen', 'PRP$'),  ('Hund', 'NN')]
Out[EXPECTED]: [...,  (u'schönen', 'JJ'), ...]

Root cause:


In [5]: from pattern.de import parse, pprint

In [6]: pprint(parse(u"Peter hat einen schönen Hund.", lemmata=True))

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA     

      Peter   NNP    NP      -      -      -      peter     
        hat   VB     VP      -      -      -      haben       
      einen   DT     NP      -      -      -      ein       
    schönen > PRP$ < NP ^    -      -      -    > schön[en] <
       Hund   NN     NP ^    -      -      -      hund      
          .   .      -       -      -      -      .     

Please direct suggestions for improvement directly to the pattern project (see e.g. https://github.com/clips/pattern/issues/63). The version of pattern.text.de included in textblob-de will be updated on a regular basis.

I am also working on the integration of additional lemmatizers into textblob_de, but PatternParserLemmatizer will remain the default choice, as it is implemented in Python.