Closed rashikcs closed 6 years ago
Could you include the full error including traceback and stuff?
Disclaimer: Comment updated by @jart to unroll attachment with interesting context.
[[Node: prefetch_queue_Dequeue = QueueDequeueV2[component_types=[DT_INT32, DT_STRING, DT_INT32, DT_FLOAT, DT_STRING, DT_INT32, DT_FLOAT, DT_INT64, DT_INT32, DT_BOOL, DT_INT32, DT_STRING, DT_INT32, DT_INT32, DT_FLOAT, DT_INT64, DT_INT32, DT_INT32, DT_INT32, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](prefetch_queue)]]
INFO:tensorflow:Caught OutOfRangeError. Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1
[[Node: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/conv1/weights/read)]]
[[Node: Loss/BoxClassifierLoss/Tile/_1555 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_6047_Loss/BoxClassifierLoss/Tile", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 163, in <module>
tf.app.run()
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/codemen/Documents/Rashik_CodeStack/Tensorflow/customVideo/models/research/object_detection/trainer.py", line 332, in train
saver=saver)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train
sv.stop(threads, close_summary_writer=True)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 296, in stop_on_exception
yield
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 494, in run
self.run_loop()
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 994, in run_loop
self._sv.global_step])
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1
[[Node: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/conv1/weights/read)]]
[[Node: Loss/BoxClassifierLoss/Tile/_1555 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_6047_Loss/BoxClassifierLoss/Tile", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op 'FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1', defined at:
File "train.py", line 163, in <module>
tf.app.run()
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/codemen/Documents/Rashik_CodeStack/Tensorflow/customVideo/models/research/object_detection/trainer.py", line 295, in train
global_summaries.add(tf.summary.histogram(model_var.op.name, model_var))
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/summary/summary.py", line 192, in histogram
tag=tag, values=values, name=scope)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 129, in _histogram_summary
name=name)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/codemen/.virtualenvs/dl4cv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1
[[Node: FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/conv1/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/conv1/weights/read)]]
[[Node: Loss/BoxClassifierLoss/Tile/_1555 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_6047_Loss/BoxClassifierLoss/Tile", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
One thing to note is that NaNs are generally a sign that there's a problem in your model. Can you make the model not produce NaNs? If not, then what do you want to see TensorBoard do with NaN data in histograms and why do you feel it would be useful?
The same dataset is being used to train model using ssd_inception_v2 & ssd_mobilenet_v1 file. Then it trained properly but whenever i am using other than those two it is producing this error.
The best place to get support is StackOverflow since this is not a bug or feature request.
I have the same problem. Using a dataset that trained perfectly fine on SSD mobilenet, but produces this exact same error on Faster RCNN resnet 101
1) this is not a tensorboard issue 2) this seems to happen on other Faster RCNN models except for faster_rcnn_inception_resnet_v2_atrous
I encountered the same issue. I run the code in python3.5+tf1.4.1
and have no problem, but when I run the same code in python3.6+tf1.3
this error printed. I think this is a bug.
If you can establish that it's a TensorFlow bug then it's worth opening an issue, but that would be in the TensorFlow issue tracker: https://github.com/tensorflow/tensorflow/issues The TensorBoard failure is just a symptom of the numerical instability in the underlying TensorFlow model code.
System information
Describe the problem
InvalidArgumentError (see above for traceback): Nan in summary histogram for: FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights_1 [[Node: FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights_1/tag, FirstStageFeatureExtractor/resnet_v1_101/block3/unit_6/bottleneck_v1/conv2/weights/read)]] [[Node: FirstStageFeatureExtractor/resnet_v1_101/block3/unit_16/bottleneck_v1/conv3/BatchNorm/gamma/read/_777 = _Recv client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_2768_FirstStageFeatureExtractor/resnet_v1_101/block3/unit_16/bottleneck_v1/conv3/BatchNorm/gamma/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0" ()]]