Closed neverdoubt closed 7 years ago
I suspect this is because your vocab_size
declared in problem_hparams
is smaller than the actual number of tokens in your vocabulary. If you hit a tokens with a number higher than your vocab_size
, you'll get a NaN
on GPU, see here:
https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits
"labels: Tensor of shape [d_0, d1, ..., d{r-1}] (where r is rank of labels and result) and dtype int32 or int64. Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this op is run on CPU, and return NaN for corresponding loss and gradient rows on GPU."
It's annoying that it gives a NaN and not an error, but TF guy say it's hard to propagate errors from CUDA on GPU. In any case, try to increase your softmax size, like here: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem_hparams.py#L422
Let us know if that helps!
@lukaszkaiser It is strange, i used word-piece model and it computes the actual length of vocab_size (which is 27284, i also checked the actual value of "subtokenizer.vocab_size"). Anyway i add 2**6 margin as you advised.
p.input_modality = { "inputs": (registry.Modalities.SYMBOL, subtokenizer.vocab_size+2**6) } ... also similar change to target_modality
Anyway, I creating vocab file, using 32k as traget vocab_size results vocab of 140k length. (140k after 27k ) So, i used the one iteration earlier file of 27284(27k) length. I hope this is not relevant to this problem.
These subword tokenizer vocab sizes look like a mess, we'll need to make it cleaner. But does it work for you now?
@lukaszkaiser It occurs occasionally (after 1k ~15k steps). So far so good! I'll let you now after watching about 50k training steps. Thanks a lot!
Hope it works now, please reopen if you see the problem again!
@lukaszkaiser, unfortunately it occurs again after about 31k training steps.
INFO:tensorflow:loss = 1.8717, step = 31601 (182.200 sec) INFO:tensorflow:Saving checkpoints for 31622 into /home/minjoong/t2t_train/koen_tokens_32k/transformer-transformer_big/margin/model.ckpt. ERROR:tensorflow:Model diverged with loss = NaN. Traceback (most recent call last): File "/home/minjoong/tf1_2/bin/t2t-trainer", line 82, in
tf.app.run() File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/minjoong/tf1_2/bin/t2t-trainer", line 79, in main schedule=FLAGS.schedule) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 242, in run run_locally(exp_fn(output_dir)) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 534, in run_locally exp.train() File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train hooks=self._train_monitors + extra_hooks) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train monitors=hooks) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func return func(*args, *kwargs) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit loss = self._train_model(input_fn=input_fn, hooks=hooks) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _trainmodel , loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss]) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run run_metadata=run_metadata) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run run_metadata=run_metadata) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run return self._sess.run(args, **kwargs) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 960, in run run_metadata=run_metadata)) File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 477, in after_run raise NanLossDuringTrainingError tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
I added a new translation (new language pair) problem using my own dataset. I also made vocab.-set using the dataset.
Training seems working since the loss had been dropped under 2.0
config setting : transformer_big, titan X(12G) * 4.
However, i got NaN loss problem occasionally. (sometimes it occurs after ~10k steps and sometimes.. after 100k steps...)
Any thought? is there a way to by-pass this problem?
Here's detailed output. ` INFO:tensorflow:Performing local training. INFO:tensorflow:datashard_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3'] INFO:tensorflow:caching_devices: None INFO:tensorflow:Doing model_fn_body took 3.163 sec. INFO:tensorflow:Doing model_fn_body took 1.986 sec.
......
INFO:tensorflow:loss = 1.63497, step = 97601 (181.782 sec) INFO:tensorflow:Saving checkpoints for 97607 into /home/minjoong/t2t_train/koen_tokens_32k/transformer-transformer_big/model.ckpt. ERROR:tensorflow:Model diverged with loss = NaN. Traceback (most recent call last): File "/home/minjoong/tf1_2/bin/t2t-trainer", line 82, in
tf.app.run()
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/minjoong/tf1_2/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 242, in run
run_locally(exp_fn(output_dir))
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 534, in run_locally
exp.train()
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, *kwargs)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _trainmodel
, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run
run_metadata=run_metadata)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(args, **kwargs)
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 960, in run
run_metadata=run_metadata))
File "/home/minjoong/tf1_2/local/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 477, in after_run
raise NanLossDuringTrainingError
`