rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Handle Japanese edge cases in `simple_tokenize` #56

Closed rspeer closed 6 years ago

rspeer commented 6 years ago

We insert some spurious token boundaries when Japanese text is being run through simple_tokenize, because of a few characters that don't match any of our "spaceless scripts".

Japanese text is only run through simple_tokenize in unusual situations, where we kind of don't want to tokenize Japanese unless the token boundaries are really obvious, which is the case in ConceptNet. This change should not, for example, affect a language pipeline that is tokenizing Japanese text as Japanese, because that would use MeCab, not simple_tokenize.

Tahnan commented 6 years ago

Looks good; but before I merge, do you want to also update the CHANGELOG?