Open surafelml opened 6 years ago
Try text_encoder.TokenTextEncoder(vocab_filename, replace_oov="UNK")
(and I think the UNK
token must be explicitly included in the vocabulary).
Note that several people reported worse BLEU when using external BPE compared with the subwords implemented in T2T (of course, using the same vocab size in both experiments).
Also make sure to preprocess the content to be translated in the same way as you have preprocessed your training data, i.e. use the same tokenizer and apply the same bpe model (#219 , #84 , #10 , #49 )
@martinpopel @mehmedes Many Thanks! I will go through this changes and will get back to you in short.
It works at time of decoding, however, completely wrong translation: Few points
Here is the problem spec. am currently using:
@registry.register_problem
class TranslateEnitBpe8k(translate.TranslateProblem):
@property
def targeted_vocab_size(self):
return 8000
@property
def vocab_name(self):
return "vocab.enit"
def generator(self, data_dir, tmp_dir, train):
symbolizer_vocab = generator_utils.get_or_generate_vocab(
data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size, _ENIT_TRAIN_DATASETS)
datasets = _ENIT_TRAIN_DATASETS if train else _ENIT_TEST_DATASETS
tag = "train" if train else "dev"
data_path = translate.compile_data(tmp_dir, datasets, "enit_tok_%s" % tag)
return translate.token_generator(data_path + ".lang1", data_path + ".lang2",
symbolizer_vocab, EOS)
@property
def input_space_id(self):
return problem.SpaceID.GENERIC
@property
def target_space_id(self):
return problem.SpaceID.GENERIC
@property
def use_subword_tokenizer(self):
return False
@registry.register_hparams
def translate_enit_bpe8k_hparams():
hparams = transformer.transformer_base_single_gpu()
hparams.batch_size = 4096
hparams.max_length = 50
return
Any clue on what's going on?
Many Thanks!
I have trained a base transformer model using the sub-word segmentation approach of Sennrich et al. (https://github.com/rsennrich/subword-nmt). This requires me to set the subword_tokenizer in the new problem definition of t2t.
However, after a successful training I am facing (traceback while decoding batch 0)
Many Thanks!