Closed renoust closed 6 years ago
Hi! let me check that sometime this week. Thanks
Sorry, this is a duplicate of a previous issue, for which you suggested:
seems like encoding error, try and throw in encoded utf8 string or possibly need to your terminal settings of you tried the code from python cli http://stackoverflow.com/questions/13046240/parseerror-not-well-formed-invalid-token-using-celementtree
Here is more input: this problem also occurs with the tokenize example
>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')
I tried your solution of UTF-8 encoding, but it didn't solve the issue.
input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'.encode('utf-8')
list_of_tokens = jTokenize(input_sentence)
I tested other encoding without results: shift-jis, euc-jp, cp932, euc-jisx0213, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-2022-jp-ext, iso-2022-jp-2004... These give the same error as Unicode:
File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
File "<string>", line 124, in XML
cElementTree.ParseError: not well-formed (invalid token): line 1, column 2
For the UTF-X I get different errors, with for UTF-8:
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
File "<string>", line 125, in XML
cElementTree.ParseError: no element found: line 1, column 0
with slightly different error if I try utf-16, utf-32, cp932, shift-jis...
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
File "<string>", line 124, in XML
cElementTree.ParseError: syntax error: line 1, column 19
Alright, it seems that usage of cabocha has changed/is different. I didn't fork the project, so I'm just giving you the edits (working on OSX 10.11)
I'm running cabocha 0.69 I figured out you need to parse the XML parsing of cabocha right?
A few things were messing around in jCabocha.py, and the problem seems to happen in the cabocha() function:
My version of Cabocha manages well utf-8, so the forced reconversion
try: sent = sent.encode('utf-8')
is making some issues somehow.
same for the returned params
return unicode(output, 'utf-8')
as for the subprocess call, with your syntax, it seemed not able to pass the xml output (at best I had the "tree" output corresponding to -f0). Also you don't need to force-feed the standard input, just passing the file path works.
command = ['/usr/local/bin/cabocha', '-f3', temp.name]
Now I get the following output from your demo (not sure it's right):
text = u'監督、俳優、ストーリー、演出、全部最高!'
print classifier.baseline(text)
Pos Score = 1.000 Neg Score = 0.000
Text is Positive
Maybe you will be able to confirm that the output with the current SentiWordNet is right. For the sentence:
u'監督、俳優、ストーリー、演出、全部最高!'
I get Pos: 1 Neg: 0
against your example which gives Pos: 0.625 Neg: 0.125
Is it only due to updated data?
I tested the value of 'sad':
print sentiwordnet[jpwordnet[u'寂しい']][1]
and obtained:
0.625
I found that the code in cabocha
command = ['cabocha', '-f','3 <', temp.name]
process = subprocess.Popen(command, stdout=subprocess.PIPE)
doesn't return xml string ,just run jCabocha.py and you know it. same as @renoust said.
maybe you can use this instant
command = ['cabocha', '-f', '3']
process = subprocess.Popen(command,stdin=open(temp.name,'r'), stdout=subprocess.PIPE)
use this and you can get xml string.
Merged
Hi everyone,
I am trying to run your sentiment analysis demo and I am facing a cElementTree.ParseError. I am running on OSX 10.11 with Python 2.7. I downloaded the wordnet files (as of today: SentiWordNet_3.0.0_20130122.txt, with the current wn: 2010-10-22). I ran your example as you presented:
and obtain the following error:
However, the polarity score example works fine, and I obtain the right scores! If you have any idea, I'd be grateful for your help!
Best,