tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.34k stars 3.47k forks source link

Unable to decode a pre-processed input with sub-word segmentation #578

Open surafelml opened 6 years ago

surafelml commented 6 years ago

I have trained a base transformer model using the sub-word segmentation approach of Sennrich et al. (https://github.com/rsennrich/subword-nmt). This requires me to set the subword_tokenizer in the new problem definition of t2t.

 @property
  def use_subword_tokenizer(self):
    return False 

However, after a successful training I am facing (traceback while decoding batch 0)

INFO:tensorflow: batch 47
INFO:tensorflow:Decoding batch 0
Traceback (most recent call last):
  File "/home/anaconda3/bin/t2t-decoder", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.4.2', 't2t-decoder')
.
.
.
  File "/home/anaconda3/lib/python3.6/site-packages/tensor2tensor-1.4.2-py3.6.egg/tensor2tensor/data_generators/text_encoder.py", line 257, in encode
    ret = [self._token_to_id[tok] for tok in tokens]
  File "/home/anaconda3/lib/python3.6/site-packages/tensor2tensor-1.4.2-py3.6.egg/tensor2tensor/data_generators/text_encoder.py", line 257, in <listcomp>
    ret = [self._token_to_id[tok] for tok in tokens]
KeyError: 'F@@'

Many Thanks!

martinpopel commented 6 years ago

Try text_encoder.TokenTextEncoder(vocab_filename, replace_oov="UNK") (and I think the UNK token must be explicitly included in the vocabulary). Note that several people reported worse BLEU when using external BPE compared with the subwords implemented in T2T (of course, using the same vocab size in both experiments).

mehmedes commented 6 years ago

Also make sure to preprocess the content to be translated in the same way as you have preprocessed your training data, i.e. use the same tokenizer and apply the same bpe model (#219 , #84 , #10 , #49 )

surafelml commented 6 years ago

@martinpopel @mehmedes Many Thanks! I will go through this changes and will get back to you in short.

surafelml commented 6 years ago

It works at time of decoding, however, completely wrong translation: Few points

Here is the problem spec. am currently using:

@registry.register_problem
class TranslateEnitBpe8k(translate.TranslateProblem):

  @property
  def targeted_vocab_size(self):
    return 8000

  @property
  def vocab_name(self):
    return "vocab.enit"

  def generator(self, data_dir, tmp_dir, train):
    symbolizer_vocab = generator_utils.get_or_generate_vocab(
        data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size, _ENIT_TRAIN_DATASETS)

    datasets = _ENIT_TRAIN_DATASETS if train else _ENIT_TEST_DATASETS
    tag = "train" if train else "dev"
    data_path = translate.compile_data(tmp_dir, datasets, "enit_tok_%s" % tag)

    return translate.token_generator(data_path + ".lang1", data_path + ".lang2",
                                      symbolizer_vocab, EOS)

  @property
  def input_space_id(self):
    return problem.SpaceID.GENERIC

  @property
  def target_space_id(self):
    return problem.SpaceID.GENERIC

  @property
  def use_subword_tokenizer(self):
    return False

  @registry.register_hparams
  def translate_enit_bpe8k_hparams():
    hparams = transformer.transformer_base_single_gpu()
    hparams.batch_size = 4096
    hparams.max_length = 50

    return 

Any clue on what's going on?

Many Thanks!