question about joint bpe vocab size

rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

MIT License

2.18k stars 464 forks source link

If you tie the embeddings between the source and the target side, you'll want to go for option 2 (~70000).

If you learn separate embeddings, it's hard to predict how many of the BPE units that you generate you will only see in the source, only in the target, or both. You can use the script get_vocabulary.py to measure the vocabulary size on your training set. By the way, I would then also recommend that you disallow any BPE units in your test set that you didn't see during training, since the model won't have learned what to do with them... see also https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt

rsennrich / subword-nmt

question about joint bpe vocab size #79