tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

t2t_decoder hangs when dot_product_relative_v2 is used #1506

Closed vidavakil closed 5 years ago

vidavakil commented 5 years ago

Hello,

I am trying to train a custom Transformer model that has a decoder only (with a custom bottom['targets']), for sequence generation. I was able to train and generate from the model when I had not specified any other special params. However, the generated sequences frequently had a failure mode where certain tokens repeated too often.

I then added the two following params and am training a new model. hparams.self_attention_type = "dot_product_relative_v2" hparams.max_relative_position = 256

However, now when I run t2t_decoder, it hangs and does not generate any output (and it's hard to kill it with ^C, and I have to do a kill -9). I run the decoder in interactive mode, and simply press the return at the '>' prompt.

t2t_decoder --data_dir="${DATA_DIR}" --decode_hparams="${DECODE_HPARAMS}" --decode_interactive --hparams="sampling_method=random" --hparams_set=${HPARAMS_SET} --model=${MODEL} --problem=${PROBLEM} --output_dir=${TRAIN_DIR}

where:

DECODE_HPARAMS="alpha=0,beam_size=1,extra_length=2048" MODEL=transformer

OS: macOS, High Sierra

$ pip freeze | grep tensor Error [Errno 20] Not a directory: '/Users/vida_vakil/miniconda3/lib/python3.6/site-packages/magenta-1.0.2-py3.6.egg' while executing command git rev-parse Exception: .... NotADirectoryError: [Errno 20] Not a directory: '/Users/vida_vakil/miniconda3/lib/python3.6/site-packages/magenta-1.0.2-py3.6.egg'

The model I am using is based on Score2Perf (https://github.com/tensorflow/magenta/tree/master/magenta/models/score2perf), and I have installed it using instructions from their page, and here: https://github.com/tensorflow/magenta Looks like the error has to do with the egg thing.

$ python -V Python 3.6.6 :: Anaconda, Inc.

tensorflow 1.12.0 tensor2tensor 1.13.0

Thanks in advance

vidavakil commented 5 years ago

More specifically, the job hangs after printing the following:

Restoring parameters from /.../model.ckpt-3000 INFO:tensorflow:Running local_init_op. ... tf_logging.py:115] Running local_init_op. INFO:tensorflow:Done running local_init_op. ... tf_logging.py:115] Done running local_init_op.

vidavakil commented 5 years ago

I was able to generate from the model after all. It took much longer than without relative_encoding, and I also had to stop the training to release resources, or both would seem to hang for hours.