2019-11-07 11:59:50.372876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-11-07 11:59:50.373345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-07 11:59:50.373355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-11-07 11:59:50.373362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-11-07 11:59:50.373425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7429 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into exp/model.ckpt.
INFO:tensorflow:exp/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
2019-11-07 12:00:33.494592: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-11-07 12:00:34.100708: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7fbc91c15700 = {1, 0} Found Inf or NaN global norm.
INFO:tensorflow:Error recorded from training_loop: Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]
Caused by op 'VerifyFinite/CheckNumerics', defined at:
File "main.py", line 856, in <module>
tf.app.run()
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 712, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
saving_listeners=saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
config)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2534, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "main.py", line 511, in model_fn
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
File "bert/optimization.py", line 74, in create_optimizer
(grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
return verify_tensor_all_finite_v2(t, msg, name)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
verify_input = array_ops.check_numerics(x, message=message)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 501, in new_func
return func(*args, **kwargs)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.
Traceback (most recent call last):
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[{{node VerifyFinite/CheckNumerics}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 856, in <module>
tf.app.run()
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 712, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2457, in train
rendezvous.raise_errors()
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/six.py", line 696, in reraise
raise value
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
saving_listeners=saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
run_metadata=run_metadata)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
raise six.reraise(*original_exc_info)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/six.py", line 696, in reraise
raise value
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
return self._sess.run(*args, **kwargs)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
run_metadata=run_metadata)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
return self._sess.run(*args, **kwargs)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]
Caused by op 'VerifyFinite/CheckNumerics', defined at:
File "main.py", line 856, in <module>
tf.app.run()
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 712, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
saving_listeners=saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
config)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2534, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "main.py", line 511, in model_fn
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
File "bert/optimization.py", line 74, in create_optimizer
(grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
return verify_tensor_all_finite_v2(t, msg, name)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
verify_input = array_ops.check_numerics(x, message=message)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 501, in new_func
return func(*args, **kwargs)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]
How to solve it?