For languages that not share an alphabet, like chinese and english, should I train the shared bpe model or train their own bpe model separately?

rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

MIT License

2.18k stars 464 forks source link

For languages that not share an alphabet, like chinese and english, should I train the shared bpe model or train their own bpe model separately? #75

Closed luckysofia closed 5 years ago

luckysofia commented 5 years ago

@rsennrich

rsennrich commented 5 years ago

For Russian, we have used transliteration in the past to still train a shared BPE model and get more consistent segmentation.

For Chinese, we typically train separate models, since there's no advantage to sharing the vocabulary, and there's a slight efficiency cost (you either use a larger vocabulary, or your resulting segmentation will be more aggressive and result in longer sequences). If there's some good reason why you want to share vocabularies, for example to train a multilingual model, you can, and other people have done so.

echan00 commented 4 years ago

@rsennrich do you think it is a good idea to share vocabulary if the texts to be segmented may have small bits of the other language? Usually named entities..

rsennrich commented 4 years ago

sure, you can also share vocabularies across different alphabets. One thing you should keep in mind that this might result in seeing input tokens at test time that you've never seen (as input) at training time, which will likely produce garbage.

I discuss here how to prevent that while still sharing BPE operations. If you want the vocabulary to be shared (i.e. for embedding tying and/or multilingual models), you need to consider for yourself if this could happen (i.e. seeing input tokens at test time unseen at training time), and how you want to deal with this.