Closed hoshiumiarata closed 8 years ago
I have same problem.
Workaround: use tokenizer.tokenizeForSentence(text);
instead of tokenizer.tokenize(text);
@ukrainskiysergey Thank you for your contribution. I tried and tested these changes. This PR have a breaking change about word_position, but this behavior is ideal, I think. I will be happy if you add some tests when you send PR next time, thanks.
The tokenizer in the texts like 'あ、あ' gives an incorrect result (at least in terms of practical use). word_position for each token (‘あ’, ‘、’, ’あ’) is equal to (1, 2, 1). It would be logical to return the (1, 2, 3).