Unable to decode a pre-processed input with sub-word segmentation

surafelml commented 6 years ago

I have trained a base transformer model using the sub-word segmentation approach of Sennrich et al. (https://github.com/rsennrich/subword-nmt). This requires me to set the subword_tokenizer in the new problem definition of t2t.

 @property
  def use_subword_tokenizer(self):
    return False

However, after a successful training I am facing (traceback while decoding batch 0)

INFO:tensorflow: batch 47
INFO:tensorflow:Decoding batch 0
Traceback (most recent call last):
  File "/home/anaconda3/bin/t2t-decoder", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.4.2', 't2t-decoder')
.
.
.
  File "/home/anaconda3/lib/python3.6/site-packages/tensor2tensor-1.4.2-py3.6.egg/tensor2tensor/data_generators/text_encoder.py", line 257, in encode
    ret = [self._token_to_id[tok] for tok in tokens]
  File "/home/anaconda3/lib/python3.6/site-packages/tensor2tensor-1.4.2-py3.6.egg/tensor2tensor/data_generators/text_encoder.py", line 257, in <listcomp>
    ret = [self._token_to_id[tok] for tok in tokens]
KeyError: 'F@@'

Many Thanks!

martinpopel commented 6 years ago

Try text_encoder.TokenTextEncoder(vocab_filename, replace_oov="UNK") (and I think the UNK token must be explicitly included in the vocabulary). Note that several people reported worse BLEU when using external BPE compared with the subwords implemented in T2T (of course, using the same vocab size in both experiments).

mehmedes commented 6 years ago

Also make sure to preprocess the content to be translated in the same way as you have preprocessed your training data, i.e. use the same tokenizer and apply the same bpe model (#219 , #84 , #10 , #49 )

surafelml commented 6 years ago

@martinpopel @mehmedes Many Thanks! I will go through this changes and will get back to you in short.

surafelml commented 6 years ago

It works at time of decoding, however, completely wrong translation: Few points

Both the train & dev set are preprocessed in the same way (using a shared bpe model).
The model trained almost for 250k steps, my expectation was at least to get a comparable result to the T2T internal subword segmentation approach.

Here is the problem spec. am currently using:

@registry.register_problem
class TranslateEnitBpe8k(translate.TranslateProblem):

  @property
  def targeted_vocab_size(self):
    return 8000

  @property
  def vocab_name(self):
    return "vocab.enit"

  def generator(self, data_dir, tmp_dir, train):
    symbolizer_vocab = generator_utils.get_or_generate_vocab(
        data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size, _ENIT_TRAIN_DATASETS)

    datasets = _ENIT_TRAIN_DATASETS if train else _ENIT_TEST_DATASETS
    tag = "train" if train else "dev"
    data_path = translate.compile_data(tmp_dir, datasets, "enit_tok_%s" % tag)

    return translate.token_generator(data_path + ".lang1", data_path + ".lang2",
                                      symbolizer_vocab, EOS)

  @property
  def input_space_id(self):
    return problem.SpaceID.GENERIC

  @property
  def target_space_id(self):
    return problem.SpaceID.GENERIC

  @property
  def use_subword_tokenizer(self):
    return False

  @registry.register_hparams
  def translate_enit_bpe8k_hparams():
    hparams = transformer.transformer_base_single_gpu()
    hparams.batch_size = 4096
    hparams.max_length = 50

    return

Any clue on what's going on?

Many Thanks!

tensorflow / tensor2tensor

Unable to decode a pre-processed input with sub-word segmentation #578