How to decide the size of subword vocabulary?

google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Apache License 2.0

10.09k stars 1.16k forks source link

How to decide the size of subword vocabulary? #557

Closed thilakshiK closed 3 years ago

thilakshiK commented 3 years ago

I want to implement sentencepiece BPE as my segmentation algorithm for my NMT task. My corpus size is less than 100k. Also, the source and target languages are very distant languages.

Should I use a joint vocabulary or two separate vocabularies for source and target?
What should be the size of the subword vocabulary?

Thank You

taku910 commented 3 years ago

The optimal size totally depends on the training data. However, 8k-32k is widely used in many tasks. In addition, joint-vocab is becoming popular these days.