rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

About the vocabulary size #108

Closed TomasAndersonFang closed 2 years ago

TomasAndersonFang commented 2 years ago

Hello Rico Sennrich,

When reading your paper "Neural Machine Translation of Rare Words with Subword Units", I found that you said, "The final symbol vocabulary size is equal to the size of the initial vocabulary, plus the number of merge operations". When I reviewed your given toy example, I found that if I set the number of merge operations to 1, there is a problem: 1) if the initial vocabulary points out the symbol vocabulary, the size of symbol vocabulary after merging is equal to the initial vocabulary (In the first merge operation, 'es' has the highest frequency and BPE merges 'e' and 's'. When we recount symbol vocabulary, 's' is removed and 'es' is added, so vocabulary keeps constant); 2) if the initial vocabulary points out word vocabulary, the size of symbol vocabulary is not equal to 5 (4 words plus 1 merge operation).

This problem confuses me these days, so could you use your toy example to tell me how to calculate the vocabulary size?

Thank you very much!!!

rsennrich commented 2 years ago

Strictly speaking, this statement is only true under the assumption that it never happens that all instances of a character (or other subword unit) are merged into larger subword units. In practice, this might happen, but it will be very rare with a normal-scale training set (there will be many occurrences of 's' that do not follow 'e', so 's' will not be removed).

TomasAndersonFang commented 2 years ago

Thanks for your reply! It solves all my confusion~