Open gchlodzinski opened 4 years ago
I submitted the same issue before (#36), and I haven't found a solution. I think there may be a numerically unstable function in the code.
It still exists....
Did you use the openwebtext dataset or a custom one? @gchlodzinski
Hi, training SMALL model works fine but BASE model ends up with NaN loss. I tried decreasing learning rate to 1e-4 but it did not help (and it could not since the error happens during warmup phase, when learning rate is still very low). It can occur randomly after first couple of steps (even after first one). Please advise. Here is my training log:
26/1000000 = 0.0%, SPS: 0.4, ELAP: 1:05, ETA: 28 days, 21:44:53 - loss: 47.1119 27/1000000 = 0.0%, SPS: 0.4, ELAP: 1:06, ETA: 28 days, 10:18:42 - loss: 46.3502 28/1000000 = 0.0%, SPS: 0.4, ELAP: 1:08, ETA: 27 days, 23:43:51 - loss: 46.1481 29/1000000 = 0.0%, SPS: 0.4, ELAP: 1:09, ETA: 27 days, 13:46:58 - loss: 45.7326 30/1000000 = 0.0%, SPS: 0.4, ELAP: 1:10, ETA: 27 days, 4:30:34 - loss: 45.5664 31/1000000 = 0.0%, SPS: 0.4, ELAP: 1:12, ETA: 26 days, 19:48:18 - loss: 45.1209 32/1000000 = 0.0%, SPS: 0.4, ELAP: 1:13, ETA: 26 days, 11:41:27 - loss: 44.8707 ERROR:tensorflow:Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. Traceback (most recent call last): File "run_pretraining.py", line 385, in
main()
File "run_pretraining.py", line 381, in main
args.model_name, args.data_dir, hparams))
File "run_pretraining.py", line 344, in train_or_eval
max_steps=config.num_train_steps)
File "/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/six.py", line 703, in reraise
raise value
File "/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimatorspec
, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(original_exc_info)
File "/six.py", line 703, in reraise
raise value
File "/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run( args, kwargs)
File "/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.