Closed Hannibal046 closed 2 years ago
the dictionary resulting from build_dictionary.py
is sorted by frequency, so you can easily just take the first N lines, or cut off M lines from the bottom, if you want to reduce the vocabulary size in some way. Note that this doesn't directly affect the splitting done in BPE (it is bottom-up, so a character-level representation like you saw is default for unknown script).
apply_bpe.py
has an argument --glossaries
, with which you can define regular expressions for words that should be ignored for BPE. If you want to systematically ignore words in Chinese script, this should be possible.
Ok. I got it ! Thanks for your detailed reply
Hi, thanks for the great work! After reading the code about the vocabulary filtering strategy. I have a question and I want to know is there a possible solution.
Suppose I have a
En-De
dataset, after usinglearn-bpe
on the concatenation of both datasets, I got acode.bpe
file, with which I can segment both dataset.But if in
En
dataset, there accidentally mix with a Chinese Word like欲言又止
only appearing once. Even if I usevocabulary-threshold
inapply-bpe
, it will still be tokenized as欲@@ 言@@ 又@@ 止
. And will be included in the vocabulary generated bynematus/data/build_dictionary.py
which is definitely not what I want. What I want is欲言又止
--><UNK>
.So I'm wondering is it more reasonable to add a
threshold
option inbuild_dictionary.py
as well ? Thanks so much.