Hello,
I would like to use your learn_joint_bpe_and_vocab.py function to train a BPE tokenizer for Japanese.
The problem is, since Japanese's kanji script, similarly to Mandarin, doesn't separate words and writes everything in one long sequence.
So I was wondering if this feature of Japanese might affect the training of the BPE model, or does it support it somehow.
Thank you!
BPE runs on unsegmented text in principle, but subword-nmt uses a caching strategy that is probably suboptimal for scripta continua. If speed or memory consumption is a problem in your setup, you can either:
use an alternative c++ or Rust implementation of BPE, such as YouTokenToMe, fastBPE, huggingface or SentencePiece (I have not tested how well they support scripta continua).
first do a word segmentation for Japanese, e.g. with Mecab.
Hello, I would like to use your learn_joint_bpe_and_vocab.py function to train a BPE tokenizer for Japanese. The problem is, since Japanese's kanji script, similarly to Mandarin, doesn't separate words and writes everything in one long sequence. So I was wondering if this feature of Japanese might affect the training of the BPE model, or does it support it somehow. Thank you!