Unknown word and vocabulary filter

Hannibal046 commented 2 years ago

Hi, thanks for the great work! After reading the code about the vocabulary filtering strategy. I have a question and I want to know is there a possible solution.

Suppose I have a En-De dataset, after using learn-bpe on the concatenation of both datasets, I got a code.bpe file, with which I can segment both dataset.

But if in En dataset, there accidentally mix with a Chinese Word like 欲言又止 only appearing once. Even if I use vocabulary-threshold in apply-bpe, it will still be tokenized as 欲@@ 言@@ 又@@ 止. And will be included in the vocabulary generated by nematus/data/build_dictionary.py which is definitely not what I want. What I want is 欲言又止--><UNK>.

So I'm wondering is it more reasonable to add a threshold option in build_dictionary.py as well ? Thanks so much.

rsennrich commented 2 years ago

the dictionary resulting from build_dictionary.py is sorted by frequency, so you can easily just take the first N lines, or cut off M lines from the bottom, if you want to reduce the vocabulary size in some way. Note that this doesn't directly affect the splitting done in BPE (it is bottom-up, so a character-level representation like you saw is default for unknown script).

apply_bpe.py has an argument --glossaries, with which you can define regular expressions for words that should be ignored for BPE. If you want to systematically ignore words in Chinese script, this should be possible.

Hannibal046 commented 2 years ago

Ok. I got it ! Thanks for your detailed reply

rsennrich / subword-nmt

Unknown word and vocabulary filter #111