rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

question about joint bpe vocab size #79

Closed zrlhk closed 4 years ago

zrlhk commented 4 years ago

Hi, If I use joint bpe; for example: Training with joint bpe model and set the bpe vocab size is 64000;and in network vocabulary: 1.for source vocab 35000 + target vocab 35000 (slightly larger than 32000+32000) 2.for source vocab 70000 + target vocab 70000 (slightly larger than 64000+64000)

so which is right? how to set the size of bpe size and network vocabulary size?

rsennrich commented 4 years ago

If you tie the embeddings between the source and the target side, you'll want to go for option 2 (~70000).

If you learn separate embeddings, it's hard to predict how many of the BPE units that you generate you will only see in the source, only in the target, or both. You can use the script get_vocabulary.py to measure the vocabulary size on your training set. By the way, I would then also recommend that you disallow any BPE units in your test set that you didn't see during training, since the model won't have learned what to do with them... see also https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt