google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.17k stars 9.6k forks source link

Japanese words consist of Hiragana and Chinese characters (Kanji) #133

Closed taku910 closed 5 years ago

taku910 commented 5 years ago

Thank you for the great work.

The tokenizer for multilingual models put whitespaces around Chinese characters (Kanji), but this treatment will unintentionally break the Japanese words consisting of Hiragana and Kanji.

Many Japanese words (especially verb and adjective) are written in the combination of Hiragana and Kanji. It is okay that arbitrary subword algorithms split them into subwords with a certain statistics, but "forced" segmentation would have a negative impact.

jacobdevlin-google commented 5 years ago

See #138