guanlinchao / bert-dst

BERT-DST: Scalable End-to-End Dialogue State Tracking with Bidirectional Encoder Representations from Transformer
101 stars 45 forks source link

Got NaN errors #2

Closed couragelfyang closed 4 years ago

couragelfyang commented 4 years ago
2019-11-07 11:59:50.372876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-11-07 11:59:50.373345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-07 11:59:50.373355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-11-07 11:59:50.373362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-11-07 11:59:50.373425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7429 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into exp/model.ckpt.
INFO:tensorflow:exp/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
2019-11-07 12:00:33.494592: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-11-07 12:00:34.100708: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7fbc91c15700 = {1, 0} Found Inf or NaN global norm.
INFO:tensorflow:Error recorded from training_loop: Found Inf or NaN global norm. : Tensor had NaN values
     [[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]

Caused by op 'VerifyFinite/CheckNumerics', defined at:
  File "main.py", line 856, in <module>
    tf.app.run()
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 712, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
    saving_listeners=saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
    config)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2534, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "main.py", line 511, in model_fn
    total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
  File "bert/optimization.py", line 74, in create_optimizer
    (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
    return verify_tensor_all_finite_v2(t, msg, name)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
    verify_input = array_ops.check_numerics(x, message=message)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 501, in new_func
    return func(*args, **kwargs)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
     [[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]

INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Traceback (most recent call last):
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
     [[{{node VerifyFinite/CheckNumerics}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 856, in <module>
    tf.app.run()
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 712, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2457, in train
    rendezvous.raise_errors()
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/six.py", line 696, in reraise
    raise value
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
    saving_listeners=saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/six.py", line 696, in reraise
    raise value
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
     [[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]

Caused by op 'VerifyFinite/CheckNumerics', defined at:
  File "main.py", line 856, in <module>
    tf.app.run()
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 712, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
    saving_listeners=saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
    config)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2534, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "main.py", line 511, in model_fn
    total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
  File "bert/optimization.py", line 74, in create_optimizer
    (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
    return verify_tensor_all_finite_v2(t, msg, name)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
    verify_input = array_ops.check_numerics(x, message=message)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 501, in new_func
    return func(*args, **kwargs)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/net/callisto/storage3/longfei/anaconda3/envs/bert-dst/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
     [[node VerifyFinite/CheckNumerics (defined at bert/optimization.py:74) ]]

How to solve it?

couragelfyang commented 4 years ago

It seems something wrong with my CUDA settings. Now fixed.