Nan in summary histogram error for training images if faster_rcnn_resnet101_coco_11_06_2017 model is used.

rashikcs commented 6 years ago

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.3.0
Python version: 3.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: cudnn-8.0
GPU model and memory: GeForce GTX 1060 6GB
Exact command to reproduce: python train.py --logtostderr --train_dir=training/ (path to training directory) --pipeline_config_path=training/faster_rcnn_resnet101.config (path to .config file)

Describe the problem

InvalidArgumentError (see above for traceback): Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights_1 [[Node: FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights/read)]] [[Node: FirstStageFeatureExtractor/resnet_v1_101/block3/unit_16/bottleneck_v1/conv3/BatchNorm/gamma/read/_777 = _Recv client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_2768_FirstStageFeatureExtractor/resnet_v1_101/block3/unit_16/bottleneck_v1/conv3/BatchNorm/gamma/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0" ()]]

jart commented 6 years ago

Could you include the full error including traceback and stuff?

rashikcs commented 6 years ago

Disclaimer: Comment updated by @jart to unroll attachment with interesting context.


     [[Node: prefetch_queue_Dequeue = QueueDequeueV2[component_types=[DT_INT32, DT_STRING, DT_INT32, DT_FLOAT, DT_STRING, DT_INT32, DT_FLOAT, DT_INT64, DT_INT32, DT_BOOL, DT_INT32, DT_STRING, DT_INT32, DT_INT32, DT_FLOAT, DT_INT64, DT_INT32, DT_INT32, DT_INT32, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](prefetch_queue)]]
INFO:tensorflow:Caught OutOfRangeError. Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1
     [[Node: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/conv1/weights/read)]]
     [[Node: Loss/BoxClassifierLoss/Tile/_1555 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_6047_Loss/BoxClassifierLoss/Tile", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    tf.app.run()
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "train.py", line 159, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/codemen/Documents/Rashik_CodeStack/Tensorflow/customVideo/models/research/object_detection/trainer.py", line 332, in train
    saver=saver)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train
    sv.stop(threads, close_summary_writer=True)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 296, in stop_on_exception
    yield
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 494, in run
    self.run_loop()
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 994, in run_loop
    self._sv.global_step])
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1
     [[Node: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/conv1/weights/read)]]
     [[Node: Loss/BoxClassifierLoss/Tile/_1555 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_6047_Loss/BoxClassifierLoss/Tile", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op 'FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1', defined at:
  File "train.py", line 163, in <module>
    tf.app.run()
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "train.py", line 159, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/codemen/Documents/Rashik_CodeStack/Tensorflow/customVideo/models/research/object_detection/trainer.py", line 295, in train
    global_summaries.add(tf.summary.histogram(model_var.op.name, model_var))
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/summary/summary.py", line 192, in histogram
    tag=tag, values=values, name=scope)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 129, in _histogram_summary
    name=name)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1
     [[Node: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/conv1/weights/read)]]
     [[Node: Loss/BoxClassifierLoss/Tile/_1555 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_6047_Loss/BoxClassifierLoss/Tile", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

jart commented 6 years ago

One thing to note is that NaNs are generally a sign that there's a problem in your model. Can you make the model not produce NaNs? If not, then what do you want to see TensorBoard do with NaN data in histograms and why do you feel it would be useful?

rashikcs commented 6 years ago

The same dataset is being used to train model using ssd_inception_v2 & ssd_mobilenet_v1 file. Then it trained properly but whenever i am using other than those two it is producing this error.

jart commented 6 years ago

The best place to get support is StackOverflow since this is not a bug or feature request.

taewookim commented 6 years ago

I have the same problem. Using a dataset that trained perfectly fine on SSD mobilenet, but produces this exact same error on Faster RCNN resnet 101

taewookim commented 6 years ago

1) this is not a tensorboard issue 2) this seems to happen on other Faster RCNN models except for faster_rcnn_inception_resnet_v2_atrous

auroua commented 6 years ago

I encountered the same issue. I run the code in python3.5+tf1.4.1 and have no problem, but when I run the same code in python3.6+tf1.3 this error printed. I think this is a bug.

nfelt commented 6 years ago

If you can establish that it's a TensorFlow bug then it's worth opening an issue, but that would be in the TensorFlow issue tracker: https://github.com/tensorflow/tensorflow/issues The TensorBoard failure is just a symptom of the numerical instability in the underlying TensorFlow model code.

tensorflow / tensorboard

Nan in summary histogram error for training images if faster_rcnn_resnet101_coco_11_06_2017 model is used. #744

System information

Describe the problem