kevincobain2000 / jProcessing

Japanese Natural Langauge Processing Libraries
http://readthedocs.org/docs/jprocessing/en/latest/
BSD 2-Clause "Simplified" License
148 stars 30 forks source link

jReads does not exist #1

Closed npx closed 9 years ago

npx commented 10 years ago

Hey there,

I compiled and installed all dependencies and now wanna run some of the examples presented here.

>>> from jNlp.jConvert import *
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/xxx/VirtualEnvs/venv-python2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
Nlp/jConvert.py", line 4, in <module>
    from jNlp.jTokenize import jTokenize, jReads
ImportError: cannot import name jReads

I tried replacing the jReads with the jTokenize method but I didn't expect that to work :)

I found and old implementation that I took and changed to using cabocha().

def jReads(target_sent):
    sentence = etree.fromstring(cabocha(target_sent).encode('utf-8'))
    jReadsToks = []
    for chunk in sentence:
        for tok in chunk.findall('tok'):
            if tok.get("read"): jReadsToks.append(tok.get("read"))
    return jReadsToks

However, I don't seem to be getting a valid XML:

気象庁が21日午前4時48分、発表した天気概況によると、
tokenizedRomaji(input_sentence)

iconv_open is not supported
                           Traceback (most recent call last):
  File "<input>", line 1, in <module>
Nlp/jConvert.py", line 42, in tokenizedRomaji2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
    for kataChunk in jReads(jSent):
  File "/Users/xxx/VirtualEnvs/venv-python2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
Nlp/jTokenize.py", line 29, in jReads
    sentence = etree.fromstring(cabocha(target_sent).encode('utf-8'))
  File "<string>", line 124, in XML
ParseError: not well-formed (invalid token): line 1, column 4

I compiled and installed iconv but is this related to the problem?

Also, I verified my installation of mecab and cabocha and both seem to work fine.

But jReads really does not exist xP

kevincobain2000 commented 10 years ago

Thanks man, infact the method doesn't exist. Will fix soon

timmahrt commented 10 years ago

Any update on this? I'm working on my own code to convert kanji to katakana/hiragana, based on the unidic dictionary from ninjal.

kevincobain2000 commented 10 years ago

tentative fix, I ll try to do the fix over this weekend by added the previous code.

def jReads(target_sent):
    sentence = etree.fromstring(cabocha(target_sent).encode('utf-8'))
    jReadsToks = []
    for chunk in sentence:
        for tok in chunk.findall('tok'):
            if tok.get("read"): jReadsToks.append(tok.get("read"))
    return jReadsToks

Although i never had that error iconv_open is not supported in the past, but let me confirm.

timmahrt commented 10 years ago

When this function runs on my machine: jNlp.jCabocha.cabocha

it doesn't return xml. However, it works fine if I change the line command = ['cabocha', '-f', '3 <', temp.name] to command = ['cabocha', '-f', '3', temp.name]

Also, you're looking for the 'read' label in the xml? But the xml I get back is pretty sparse (still useable for my purposes though):

荒く

I'm not sure if maybe I configured cabocha or one of its dependencies incorrectly?

anyong commented 9 years ago

Did you ever get the 'read' label issue fixed?

kevincobain2000 commented 9 years ago

@anyong https://github.com/kevincobain2000/jProcessing/commit/d59425fb8b63365340458ec683037f4d98e7e255 fixes it. Just didn't close the issue.