Closed katspaugh closed 6 years ago
Excuse me, then, how to change the behavior for "研究" to be parsed as a single token? (Disable the Viterbi algorithm?)
Maybe if you try this it should work: let path = tokenizer.tokenizeForSentence(txtToGloss);
@jrorsini nope, the same.
I've just tried it in my project https://kuromoji.fluentcards.com/ – changed tokentize
to tokenizeForSentence
but it's still breaking 研究 mid-word.
@katspaugh, Maybe you should add this file, https://github.com/takuyaa/kuromoji.js/blob/master/build/kuromoji.js That's the one that contains the prototype method.
@jrorsini I mean, tokenizeForSentence
worked, it's defined. It just behaves the same way as tokenize
with regard to 研究
.
This seems to be a bug.
Cool! I'll test it with my app and let you know if the fix helped with this issue.
@katspaugh I released fixed version of 0.1.2! You could check it on demo site: https://takuyaa.github.io/kuromoji.js/demo/tokenize.html Please try it. FYI @iapyeh @jrorsini
I confirm the issue is fixed in my app, too. Many thanks! 👍
I'm using kuromoji.js for a web app: https://kuromoji.fluentcards.com/
If you enter 研究, it gets broken down to 研 and 究. The same on the demo site.
However, when I try the same word on the original Java version's website,
研究
gets parsed as a single token.Is this behavior configurable?
Edit: I've looked through the source and have realized it's using the Viterbi algorithm which is not the default on the original Kuromoji demo site. Hence the difference in the output. Closing the issue.