rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Question about vocabulary filter #110

Closed Hannibal046 closed 2 years ago

Hannibal046 commented 2 years ago

Hi, I am confused about the usage of subword-nmt learn-joint-bpe-and-vocab. What is the edge case using joint-bpe ? Since all words will be debpe at test time. Why this could produce unknown words ? Thanks for answering.

Hannibal046 commented 2 years ago

And what is the real vocabulary for DL model training ? Should I use vocabulary file generate by subword nmt by taking each line as a vocabulary term ? Or should I use bped file and use space to manually create vocabulary ?

Hannibal046 commented 2 years ago

And if I use bped file to get my vocabulary by SPACE, I don't know why <UNK> token is necessary here. Sorry for taking your time.