takuyaa / kuromoji.js

JavaScript implementation of Japanese morphological analyzer

832 stars 117 forks source link

研究 is broken down into 研 and 究 #16

Closed katspaugh closed 6 years ago

katspaugh commented 7 years ago

I'm using kuromoji.js for a web app: https://kuromoji.fluentcards.com/

If you enter 研究, it gets broken down to 研 and 究. The same on the demo site.

However, when I try the same word on the original Java version's website, 研究 gets parsed as a single token.

Is this behavior configurable?

Edit: I've looked through the source and have realized it's using the Viterbi algorithm which is not the default on the original Kuromoji demo site. Hence the difference in the output. Closing the issue.

iapyeh commented 7 years ago

Excuse me, then, how to change the behavior for "研究" to be parsed as a single token? (Disable the Viterbi algorithm?)

jrorsini commented 6 years ago

Maybe if you try this it should work: let path = tokenizer.tokenizeForSentence(txtToGloss);

katspaugh commented 6 years ago

@jrorsini nope, the same. I've just tried it in my project https://kuromoji.fluentcards.com/ – changed tokentize to tokenizeForSentence but it's still breaking 研究 mid-word.

jrorsini commented 6 years ago

@katspaugh, Maybe you should add this file, https://github.com/takuyaa/kuromoji.js/blob/master/build/kuromoji.js That's the one that contains the prototype method.

katspaugh commented 6 years ago

@jrorsini I mean, tokenizeForSentence worked, it's defined. It just behaves the same way as tokenize with regard to 研究.

takuyaa commented 6 years ago

This seems to be a bug.

25 will fix this issue.

katspaugh commented 6 years ago

Cool! I'll test it with my app and let you know if the fix helped with this issue.

takuyaa commented 6 years ago

@katspaugh I released fixed version of 0.1.2! You could check it on demo site: https://takuyaa.github.io/kuromoji.js/demo/tokenize.html Please try it. FYI @iapyeh @jrorsini

katspaugh commented 6 years ago

I confirm the issue is fixed in my app, too. Many thanks! 👍