markuskiller / textblob-de

German language support for TextBlob.
https://textblob-de.readthedocs.org
MIT License
104 stars 12 forks source link

Dont have to load file every time #11

Closed Arttii closed 9 years ago

Arttii commented 9 years ago

With this small change we can share the analyzer across blob instances, so we do not load the file every time. This seems to work fine. Any reason why we are not sharing the sentiment instance globally? (Textblob-en does so)

tb = BlobberDE(analyzer=PatternAnalyzer(tokenizer=NLTKPunktTokenizer()))
markuskiller commented 9 years ago

Fair point! Thanks for the PR. Unfortunately, this breaks sentiment analysis with PatternTokenizer(). I'll try to find a solution, but I couldn't find one that doesn't break the tests just yet. The first call to PatternAnalyzer(tokenizer=PatternTokenizer()).analyze(text) works fine with your solution, but subsequent calls raise an error. This does not happen if sentiment (or s) is called locally only.

Moving sentiment to ext/_pattern/text/de/__init__.py (this is where it is located in the main textblob library) would break compatiblity with original pattern library on Python2.

Arttii commented 9 years ago

Cool, I didn't look at the tests to be honest. I'll look into it as well might find a solution.


From: Markus Killermailto:notifications@github.com Sent: ‎30.‎04.‎2015 23:46 To: markuskiller/textblob-demailto:textblob-de@noreply.github.com Cc: Arttiimailto:artyom.topchyan@live.com Subject: Re: [textblob-de] Dont have to load file every time (#11)

Fair point! Thanks for the PR. Unfortunately, this breaks sentiment analysis with PatternTokenizer(). I'll try to find a solution, but I couldn't find one that doesn't break the tests just yet. The first call to PatternAnalyzer(tokenizer=PatternTokenizer()).analyze(text) works fine with your solution, but subsequent calls raise an error. This does not happen if sentiment is called locally only.

Moving sentiment to ext/_pattern/text/de/__init__.py (this is where it is located in the main textblob library) would break compatiblity with original pattern library on Python2.


Reply to this email directly or view it on GitHub: https://github.com/markuskiller/textblob-de/pull/11#issuecomment-97980046

markuskiller commented 9 years ago

Managed to narrow problem down to Py2/Py3 difference of the map function:

It works fine on Python2 but on Python3 my implementation of the PatternAnalyzer chokes on lookups of identical (word, tag) pairs (i.e. reloading the dictionary from xml-file for every sentence concealed this bug):

1st lookup of ("schön", None) on Python3:


#ext/_pattern/text/__init__.py [line 2018]
p, s, i = self["schön"][None] 

# content of `self` dictionary in `Sentiment` class
{"schön": {"JJ": <map object with polarity values  [1.0, 0.0, 1.0]>, 
                 None: <map object with polarity values  [1.0, 0.0, 1.0]>, ...}, ...}

2nd lookup of ("schön", None) on Python3:


#ext/_pattern/text/__init__.py [line 2018]
p, s, i = self["schön"][None] 

# content of `self` dictionary in `Sentiment` class after first lookup
{"schön": {"JJ": <map object with polarity values  [1.0, 0.0, 1.0]>, 
                 None: <map object with ***consumed*** polarity values  []>, ...}, ...}

Resulting in the following error message:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-2a2bbf801819> in <module>()
----> 1 b.sentiment

/home/mki/venv/t5/lib/python3.4/site-packages/textblob/decorators.py in __get__(self, obj, cls)
     22         if obj is None:
     23             return self
---> 24         value = obj.__dict__[self.func.__name__] = self.func(obj)
     25         return value
     26 

/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/blob.py in sentiment(self)
    668         _subjectivity = 0
    669         for s in self.sentences:
--> 670             _polarity += s.polarity
    671             _subjectivity += s.subjectivity
    672         try:

/home/mki/venv/t5/lib/python3.4/site-packages/textblob/decorators.py in __get__(self, obj, cls)
     22         if obj is None:
     23             return self
---> 24         value = obj.__dict__[self.func.__name__] = self.func(obj)
     25         return value
     26 

/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/blob.py in polarity(self)
    430         :rtype: float
    431         """
--> 432         return self.sentiment[0]
    433 
    434     @cached_property

/home/mki/venv/t5/lib/python3.4/site-packages/textblob/decorators.py in __get__(self, obj, cls)
     22         if obj is None:
     23             return self
---> 24         value = obj.__dict__[self.func.__name__] = self.func(obj)
     25         return value
     26 

/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/blob.py in sentiment(self)
    422         :rtype: namedtuple of the form ``Sentiment(polarity, subjectivity)``
    423         """
--> 424         return self.analyzer.analyze(self.raw)
    425 
    426     @cached_property

/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/sentiments.py in analyze(self, text)
    140         if self.lemmatize:
    141             text = self._lemmatize(text)
--> 142         return self.RETURN_TYPE(*pattern_sentiment(text))
    143 
    144     def _lemmatize(self, raw):

/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/sentiments.py in pattern_sentiment(text)
     97         language = "de"
     98     )
---> 99     return s(text)
    100 
    101 #################### END SENTIMENT DETECTION ##################################

/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/ext/_pattern/text/__init__.py in __call__(self, s, negation, **kwargs)
   1970         # Sentiment("a horrible movie") => (-0.6, 1.0)
   1971         elif isinstance(s, basestring):
-> 1972             a = self.assessments(((w.lower(), None) for w in " ".join(self.tokenizer(s)).split()), negation)
   1973         # A pattern.en.Text.
   1974         elif hasattr(s, "sentences"):

/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/ext/_pattern/text/__init__.py in assessments(self, words, negation)
   2016                 continue
   2017             if w in self and pos in self[w]:
-> 2018                 p, s, i = self[w][pos]
   2019                 # Known word not preceded by a modifier ("good").
   2020                 if m is None:

ValueError: need more than 0 values to unpack
markuskiller commented 9 years ago

One possible solution would be to convert the map object in the vendorised pattern implementation into a more persistent list (this works on Python2 and Python3):


# ext/_pattern/text/__init__.py [lines 1911-1922]

        # Average scores of all word senses per part-of-speech tag.
        for w in words:
            words[w] = dict((pos, list(map(avg, zip(*psi)))) for pos, psi in words[w].items())
        # Average scores of all part-of-speech tags.
        for w, pos in words.items():
            words[w][None] = list(map(avg, zip(*pos.values())))
        # Average scores of all synonyms per synset.
        for id, psi in synsets.items():
            synsets[id] = list(map(avg, zip(*psi)))
        dict.update(self, words)
        dict.update(self.labeler, labels)
        dict.update(self._synsets, synsets)

The second option would be to use the solution applied in the main textblob library:


# textblob/_text.py [lines 765-776]

        # Average scores of all word senses per part-of-speech tag.
        for w in words:
            words[w] = dict((pos, [avg(each) for each in zip(*psi)]) for pos, psi in words[w].items())
        # Average scores of all part-of-speech tags.
        for w, pos in list(words.items()):
            words[w][None] = [avg(each) for each in zip(*pos.values())]
        # Average scores of all synonyms per synset.
        for id, psi in synsets.items():
            synsets[id] = [avg(each) for each in zip(*psi)]
        dict.update(self, words)
        dict.update(self.labeler, labels)
        dict.update(self._synsets, synsets)

My approach would be to stay as close to the original pattern implementation as possible (i.e. Version 1).

@Arttii Any thoughts on that?

markuskiller commented 9 years ago

Went for 2nd option as it is easier to read and is consistent with textblob main library.

Arttii commented 9 years ago

Hi,

Sorry was a bit busy with work. Ya the second one seems more sensible. Its a bit confusing for people coming from the main library and looking at the sources (was for me at least). Also it seems more "pythonic" to be honest.

Cheers for the quick fix.

Edit: I tested this with my setup, works perfect.