rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

learn_joint_bpe_and_vocab.py for Japanese #103

Closed lovodkin93 closed 2 years ago

lovodkin93 commented 3 years ago

Hello, I would like to use your learn_joint_bpe_and_vocab.py function to train a BPE tokenizer for Japanese. The problem is, since Japanese's kanji script, similarly to Mandarin, doesn't separate words and writes everything in one long sequence. So I was wondering if this feature of Japanese might affect the training of the BPE model, or does it support it somehow. Thank you!

rsennrich commented 3 years ago

BPE runs on unsegmented text in principle, but subword-nmt uses a caching strategy that is probably suboptimal for scripta continua. If speed or memory consumption is a problem in your setup, you can either: