tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.46k stars 3.49k forks source link

training stuck after the 1st step (quite odd) #1705

Closed ZihengZZH closed 5 years ago

ZihengZZH commented 5 years ago

Description

... I was conducting a simple experiment on Multi30K EN2DE translation (text-only) using model transformer and hparams_set transformer_base. It went well for some time but suddenly, the training stopped after the first training step and cannot proceed.

Environment information

OS: Ubuntu 18.04 VM

$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0

$ python -V
Python 3.7.4

For bugs: reproduction and error logs

# Steps to reproduce:
...
It's a simple text-only translation task and it went well for a couple of days. When I tried to set the hparams ```ema``` as True and run the training, it won't proceed. So I reset ```ema``` to False and rerun the training. The following error occurred.
# Error logs:
...
I0920 09:36:22.023988 140220529194816 session_manager.py:500] Running local_init_op.
I0920 09:36:22.339501 140220529194816 session_manager.py:502] Done running local_init_op.
I0920 09:36:35.505281 140220529194816 basic_session_run_hooks.py:606] Saving checkpoints for 0 into ./t2t-output/ende/model.ckpt.
2019-09-20 09:37:01.837617: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 112 of 512
2019-09-20 09:37:11.833300: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 248 of 512
2019-09-20 09:37:21.829728: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 369 of 512
2019-09-20 09:37:31.819060: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 480 of 512
2019-09-20 09:37:34.421255: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
I0920 09:38:26.924503 140220529194816 basic_session_run_hooks.py:262] loss = 8.342694, step = 1
<THEN NOTHING GOES ON>
ZihengZZH commented 5 years ago

It seems that the TF-GPU configuration has some error because I noticed this error information:

Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0
  /job:localhost/replica:0/task:0/device:XLA_CPU:0].

Reconfiguring cudnn + tensorflow-gpu env could solve the problem