We insert some spurious token boundaries when Japanese text is being run through simple_tokenize, because of a few characters that don't match any of our "spaceless scripts".
Japanese text is only run through simple_tokenize in unusual situations, where we kind of don't want to tokenize Japanese unless the token boundaries are really obvious, which is the case in ConceptNet. This change should not, for example, affect a language pipeline that is tokenizing Japanese text as Japanese, because that would use MeCab, not simple_tokenize.
We insert some spurious token boundaries when Japanese text is being run through
simple_tokenize
, because of a few characters that don't match any of our "spaceless scripts".Japanese text is only run through
simple_tokenize
in unusual situations, where we kind of don't want to tokenize Japanese unless the token boundaries are really obvious, which is the case in ConceptNet. This change should not, for example, affect a language pipeline that is tokenizing Japanese text as Japanese, because that would use MeCab, notsimple_tokenize
.