rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

BPE vocabulary config #95

Closed anrizal closed 4 years ago

anrizal commented 4 years ago

Hello,

I am trying to replicate Revisiting Low-Resource Neural Machine Translation: - A Case Study, paper. I am not quite sure with the term BPE vocabulary in relation with subword-nmt

The paper mentioned BPE operations were 30000 which I assumed correlates with _bpeoperations parameters in subwordNMT. How about BPE vocabulary, does this correlate with --vocabulary-threshold_ parameters in subword_NMT ?

Thank you

rsennrich commented 4 years ago

You can find more documentation (and scripts) for this paper here: http://data.statmt.org/rsennrich/iwslt14_lowresource/

In short, we ran learn_joint_bpe_and_vocab.py with 30,000 merge operations, but used the _--vocabularythreshold in apply_bpe.py to scale the vocabulary size to the respective amounts of training data.