I want to implement sentencepiece BPE as my segmentation algorithm for my NMT task. My corpus size is less than 100k. Also, the source and target languages are very distant languages.
Should I use a joint vocabulary or two separate vocabularies for source and target?
What should be the size of the subword vocabulary?
The optimal size totally depends on the training data. However, 8k-32k is widely used in many tasks. In addition, joint-vocab is becoming popular these days.
I want to implement sentencepiece BPE as my segmentation algorithm for my NMT task. My corpus size is less than 100k. Also, the source and target languages are very distant languages.
Should I use a joint vocabulary or two separate vocabularies for source and target?
What should be the size of the subword vocabulary?
Thank You