learn_joint_bpe_and_vocab.py for Japanese

rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

MIT License

2.18k stars 464 forks source link

learn_joint_bpe_and_vocab.py for Japanese #103

Closed lovodkin93 closed 2 years ago

lovodkin93 commented 3 years ago

Hello, I would like to use your learn_joint_bpe_and_vocab.py function to train a BPE tokenizer for Japanese. The problem is, since Japanese's kanji script, similarly to Mandarin, doesn't separate words and writes everything in one long sequence. So I was wondering if this feature of Japanese might affect the training of the BPE model, or does it support it somehow. Thank you!

rsennrich commented 3 years ago

BPE runs on unsegmented text in principle, but subword-nmt uses a caching strategy that is probably suboptimal for scripta continua. If speed or memory consumption is a problem in your setup, you can either:

use an alternative c++ or Rust implementation of BPE, such as YouTokenToMe, fastBPE, huggingface or SentencePiece (I have not tested how well they support scripta continua).
first do a word segmentation for Japanese, e.g. with Mecab.