Closed Arttii closed 9 years ago
Fair point! Thanks for the PR. Unfortunately, this breaks sentiment analysis with PatternTokenizer()
. I'll try to find a solution, but I couldn't find one that doesn't break the tests just yet. The first call to PatternAnalyzer(tokenizer=PatternTokenizer()).analyze(text)
works fine with your solution, but subsequent calls raise an error. This does not happen if sentiment
(or s
) is called locally only.
Moving sentiment
to ext/_pattern/text/de/__init__.py
(this is where it is located in the main textblob
library) would break compatiblity with original pattern
library on Python2.
Cool, I didn't look at the tests to be honest. I'll look into it as well might find a solution.
From: Markus Killermailto:notifications@github.com Sent: 30.04.2015 23:46 To: markuskiller/textblob-demailto:textblob-de@noreply.github.com Cc: Arttiimailto:artyom.topchyan@live.com Subject: Re: [textblob-de] Dont have to load file every time (#11)
Fair point! Thanks for the PR. Unfortunately, this breaks sentiment analysis with PatternTokenizer(). I'll try to find a solution, but I couldn't find one that doesn't break the tests just yet. The first call to PatternAnalyzer(tokenizer=PatternTokenizer()).analyze(text)
works fine with your solution, but subsequent calls raise an error. This does not happen if sentiment is called locally only.
Moving sentiment
to ext/_pattern/text/de/__init__.py
(this is where it is located in the main textblob
library) would break compatiblity with original pattern
library on Python2.
Reply to this email directly or view it on GitHub: https://github.com/markuskiller/textblob-de/pull/11#issuecomment-97980046
Managed to narrow problem down to Py2/Py3 difference of the map
function:
It works fine on Python2 but on Python3 my implementation of the PatternAnalyzer
chokes on lookups of identical (word, tag)
pairs (i.e. reloading the dictionary from xml-file for every sentence concealed this bug):
1st lookup of ("schön", None) on Python3:
#ext/_pattern/text/__init__.py [line 2018]
p, s, i = self["schön"][None]
# content of `self` dictionary in `Sentiment` class
{"schön": {"JJ": <map object with polarity values [1.0, 0.0, 1.0]>,
None: <map object with polarity values [1.0, 0.0, 1.0]>, ...}, ...}
2nd lookup of ("schön", None) on Python3:
#ext/_pattern/text/__init__.py [line 2018]
p, s, i = self["schön"][None]
# content of `self` dictionary in `Sentiment` class after first lookup
{"schön": {"JJ": <map object with polarity values [1.0, 0.0, 1.0]>,
None: <map object with ***consumed*** polarity values []>, ...}, ...}
Resulting in the following error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-2a2bbf801819> in <module>()
----> 1 b.sentiment
/home/mki/venv/t5/lib/python3.4/site-packages/textblob/decorators.py in __get__(self, obj, cls)
22 if obj is None:
23 return self
---> 24 value = obj.__dict__[self.func.__name__] = self.func(obj)
25 return value
26
/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/blob.py in sentiment(self)
668 _subjectivity = 0
669 for s in self.sentences:
--> 670 _polarity += s.polarity
671 _subjectivity += s.subjectivity
672 try:
/home/mki/venv/t5/lib/python3.4/site-packages/textblob/decorators.py in __get__(self, obj, cls)
22 if obj is None:
23 return self
---> 24 value = obj.__dict__[self.func.__name__] = self.func(obj)
25 return value
26
/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/blob.py in polarity(self)
430 :rtype: float
431 """
--> 432 return self.sentiment[0]
433
434 @cached_property
/home/mki/venv/t5/lib/python3.4/site-packages/textblob/decorators.py in __get__(self, obj, cls)
22 if obj is None:
23 return self
---> 24 value = obj.__dict__[self.func.__name__] = self.func(obj)
25 return value
26
/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/blob.py in sentiment(self)
422 :rtype: namedtuple of the form ``Sentiment(polarity, subjectivity)``
423 """
--> 424 return self.analyzer.analyze(self.raw)
425
426 @cached_property
/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/sentiments.py in analyze(self, text)
140 if self.lemmatize:
141 text = self._lemmatize(text)
--> 142 return self.RETURN_TYPE(*pattern_sentiment(text))
143
144 def _lemmatize(self, raw):
/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/sentiments.py in pattern_sentiment(text)
97 language = "de"
98 )
---> 99 return s(text)
100
101 #################### END SENTIMENT DETECTION ##################################
/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/ext/_pattern/text/__init__.py in __call__(self, s, negation, **kwargs)
1970 # Sentiment("a horrible movie") => (-0.6, 1.0)
1971 elif isinstance(s, basestring):
-> 1972 a = self.assessments(((w.lower(), None) for w in " ".join(self.tokenizer(s)).split()), negation)
1973 # A pattern.en.Text.
1974 elif hasattr(s, "sentences"):
/home/mki/venv/t5/lib/python3.4/site-packages/textblob_de/ext/_pattern/text/__init__.py in assessments(self, words, negation)
2016 continue
2017 if w in self and pos in self[w]:
-> 2018 p, s, i = self[w][pos]
2019 # Known word not preceded by a modifier ("good").
2020 if m is None:
ValueError: need more than 0 values to unpack
One possible solution would be to convert the map
object in the vendorised pattern
implementation into a more persistent list
(this works on Python2 and Python3):
# ext/_pattern/text/__init__.py [lines 1911-1922]
# Average scores of all word senses per part-of-speech tag.
for w in words:
words[w] = dict((pos, list(map(avg, zip(*psi)))) for pos, psi in words[w].items())
# Average scores of all part-of-speech tags.
for w, pos in words.items():
words[w][None] = list(map(avg, zip(*pos.values())))
# Average scores of all synonyms per synset.
for id, psi in synsets.items():
synsets[id] = list(map(avg, zip(*psi)))
dict.update(self, words)
dict.update(self.labeler, labels)
dict.update(self._synsets, synsets)
The second option would be to use the solution applied in the main textblob
library:
# textblob/_text.py [lines 765-776]
# Average scores of all word senses per part-of-speech tag.
for w in words:
words[w] = dict((pos, [avg(each) for each in zip(*psi)]) for pos, psi in words[w].items())
# Average scores of all part-of-speech tags.
for w, pos in list(words.items()):
words[w][None] = [avg(each) for each in zip(*pos.values())]
# Average scores of all synonyms per synset.
for id, psi in synsets.items():
synsets[id] = [avg(each) for each in zip(*psi)]
dict.update(self, words)
dict.update(self.labeler, labels)
dict.update(self._synsets, synsets)
My approach would be to stay as close to the original pattern
implementation as possible (i.e. Version 1).
@Arttii Any thoughts on that?
Went for 2nd option as it is easier to read and is consistent with textblob
main library.
Hi,
Sorry was a bit busy with work. Ya the second one seems more sensible. Its a bit confusing for people coming from the main library and looking at the sources (was for me at least). Also it seems more "pythonic" to be honest.
Cheers for the quick fix.
Edit: I tested this with my setup, works perfect.
With this small change we can share the analyzer across blob instances, so we do not load the file every time. This seems to work fine. Any reason why we are not sharing the sentiment instance globally? (Textblob-en does so)