takuyaa / kuromoji.js

JavaScript implementation of Japanese morphological analyzer
832 stars 117 forks source link

Now word_position returns the real position of the word in the text #10

Closed hoshiumiarata closed 8 years ago

hoshiumiarata commented 8 years ago

The tokenizer in the texts like 'あ、あ' gives an incorrect result (at least in terms of practical use). word_position for each token (‘あ’, ‘、’, ’あ’) is equal to (1, 2, 1). It would be logical to return the (1, 2, 3).

azu commented 8 years ago

I have same problem.

Workaround: use tokenizer.tokenizeForSentence(text); instead of tokenizer.tokenize(text);

takuyaa commented 8 years ago

@ukrainskiysergey Thank you for your contribution. I tried and tested these changes. This PR have a breaking change about word_position, but this behavior is ideal, I think. I will be happy if you add some tests when you send PR next time, thanks.