The tokenizer for multilingual models put whitespaces around Chinese characters (Kanji), but this treatment will unintentionally break the Japanese words consisting of Hiragana and Kanji.
Many Japanese words (especially verb and adjective) are written in the combination of Hiragana and Kanji. It is okay that arbitrary subword algorithms split them into subwords with a certain statistics, but "forced" segmentation would have a negative impact.
Thank you for the great work.
The tokenizer for multilingual models put whitespaces around Chinese characters (Kanji), but this treatment will unintentionally break the Japanese words consisting of Hiragana and Kanji.
Many Japanese words (especially verb and adjective) are written in the combination of Hiragana and Kanji. It is okay that arbitrary subword algorithms split them into subwords with a certain statistics, but "forced" segmentation would have a negative impact.