kevincobain2000 / jProcessing

Japanese Natural Langauge Processing Libraries
http://readthedocs.org/docs/jprocessing/en/latest/
BSD 2-Clause "Simplified" License
148 stars 30 forks source link

Cannot run sentiment analysis example (classifier) #7

Closed renoust closed 6 years ago

renoust commented 8 years ago

Hi everyone,

I am trying to run your sentiment analysis demo and I am facing a cElementTree.ParseError. I am running on OSX 10.11 with Python 2.7. I downloaded the wordnet files (as of today: SentiWordNet_3.0.0_20130122.txt, with the current wn: 2010-10-22). I ran your example as you presented:

>>> from jNlp.jSentiments import *
>>> jp_wn = 'path_to/wnjpn-all.tab'
>>> en_swn = 'path_to/SentiWordNet_3.0.0_20130122.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高!'
>>> print classifier.baseline(text)

and obtain the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jSentiments.py", line 55, in baseline
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jSentiments.py", line 48, in polarScores_text
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
  File "<string>", line 124, in XML
cElementTree.ParseError: not well-formed (invalid token): line 1, column 2

However, the polarity score example works fine, and I obtain the right scores! If you have any idea, I'd be grateful for your help!

Best,

kevincobain2000 commented 8 years ago

Hi! let me check that sometime this week. Thanks

renoust commented 8 years ago

Sorry, this is a duplicate of a previous issue, for which you suggested:

seems like encoding error, try and throw in encoded utf8 string or possibly need to your terminal settings of you tried the code from python cli http://stackoverflow.com/questions/13046240/parseerror-not-well-formed-invalid-token-using-celementtree

Here is more input: this problem also occurs with the tokenize example

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

I tried your solution of UTF-8 encoding, but it didn't solve the issue.

input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'.encode('utf-8')
list_of_tokens = jTokenize(input_sentence)

I tested other encoding without results: shift-jis, euc-jp, cp932, euc-jisx0213, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-2022-jp-ext, iso-2022-jp-2004... These give the same error as Unicode:

 File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
 File "<string>", line 124, in XML
cElementTree.ParseError: not well-formed (invalid token): line 1, column 2

For the UTF-X I get different errors, with for UTF-8:

  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
  File "<string>", line 125, in XML
cElementTree.ParseError: no element found: line 1, column 0

with slightly different error if I try utf-16, utf-32, cp932, shift-jis...

  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
  File "<string>", line 124, in XML
cElementTree.ParseError: syntax error: line 1, column 19
renoust commented 8 years ago

Alright, it seems that usage of cabocha has changed/is different. I didn't fork the project, so I'm just giving you the edits (working on OSX 10.11)

I'm running cabocha 0.69 I figured out you need to parse the XML parsing of cabocha right?

A few things were messing around in jCabocha.py, and the problem seems to happen in the cabocha() function:

My version of Cabocha manages well utf-8, so the forced reconversion try: sent = sent.encode('utf-8') is making some issues somehow. same for the returned params return unicode(output, 'utf-8')

as for the subprocess call, with your syntax, it seemed not able to pass the xml output (at best I had the "tree" output corresponding to -f0). Also you don't need to force-feed the standard input, just passing the file path works. command = ['/usr/local/bin/cabocha', '-f3', temp.name]

Now I get the following output from your demo (not sure it's right):

text = u'監督、俳優、ストーリー、演出、全部最高!'
print classifier.baseline(text)
Pos Score = 1.000 Neg Score = 0.000
Text is Positive
renoust commented 8 years ago

Maybe you will be able to confirm that the output with the current SentiWordNet is right. For the sentence:

u'監督、俳優、ストーリー、演出、全部最高!'

I get Pos: 1 Neg: 0 against your example which gives Pos: 0.625 Neg: 0.125 Is it only due to updated data?

I tested the value of 'sad': print sentiwordnet[jpwordnet[u'寂しい']][1]

and obtained: 0.625

yobo000 commented 8 years ago

I found that the code in cabocha

    command = ['cabocha', '-f','3 <', temp.name]
    process = subprocess.Popen(command, stdout=subprocess.PIPE)

doesn't return xml string ,just run jCabocha.py and you know it. same as @renoust said.

maybe you can use this instant

    command = ['cabocha', '-f', '3']
    process = subprocess.Popen(command,stdin=open(temp.name,'r'), stdout=subprocess.PIPE)

use this and you can get xml string.

kevincobain2000 commented 8 years ago

Merged