rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

BPE-Dropout question #104

Closed oaarnikoivu closed 3 years ago

oaarnikoivu commented 3 years ago

Hi,

I'm trying to implement BPE dropout using the tecnique you mention in the README, by creating an augmented training dataset by concatenating the original training (5K sentences) dataset multiple times, and then applying BPE dropout on this. I'm just wondering do I have to apply the "learn BPE" method on the concatenated dataset or does it suffice to learn BPE on the original 5K dataset, and then to simply apply BPE with the dropout probability on the concatenated dataset using the vocabulary learned on the original dataset?

rsennrich commented 3 years ago

you closed this issue, so you might have found the answer yourself, but in case anybody else finds this question:

the training part of BPE (learn_bpe) is only determined by the most frequent pair of symbols at any given time (and alphabetical order to resolve ties), so it doesn't matter if you run it on the original text or a dataset that consists of 5 copies of the original text.