Closed bwuzhang closed 4 years ago
@caisq Here's a tfdbg related report on TensorFlow Models.
@bwuzhang did you checkout the code and build the tensorflow using bazel inorder to make use of tensorflow debugger?
Because i updated to latest version of tf, it was throwing error saying 'session.wrapper' is an unexpected keyword argument
Please let me know
Yes I built it from source. r1.3 doesn't have tfdbg support for slim so I have to checkout the latest version of tensorflow/tensorflow/contrib/slim/python/slim/learning.py and built it from source on r1.3 branch.
@bwuzhang Thanks for the reply.
@bwuzhang did you solve this problem? I met the same question, and do not know how to continue.
Did you solve this problem? I met the same issue
you can print every value eagerly in object detection with this py file. https://github.com/bysowhat/object_detection_debug
You can use:
slim.learning.train(...
session_wrapper=tf_debug.LocalCLIDebugWrapperSession
....)
Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
System information
Describe the problem
I want to debug a NaN error with tfdbg on the Object Detection API rfcn network. I checkout latest version of 'tensorflow/tensorflow/contrib/slim/python/slim/learning.py' to include the tfdbg support. After hitting 'run' twice in tfdbg, I encountered the following error. This can be reproduced with the pets example by adding 'session_wrapper=tf_debug.LocalCLIDebugWrapperSession' to the 'slim.learning.train' in 'models/object_detection/trainer.py'.
Source code / logs
2017-09-03 13:04:11.473343: I tensorflow/core/debug/debug_graph_utils.cc:229] For debugging, tfdbg is changing the parallel_iterations attribute of the Enter/RefEnter node "gradients/map/while/TensorArrayReadV3/Enter_1_grad/b_acc_1" on device "/job:localhost/replica:0/task:0/gpu:0" from 16 to 1. (This does not affect subsequent non-debug runs.) INFO:tensorflow:Error reported to Coordinator: <type 'exceptions.ValueError'>, Node name 'parallel_read/filenames/Assert/Assert/data_0' is not found in partition graphs of device /job:localhost/replica:0/task:0/cpu:0. Traceback (most recent call last): File "object_detection/train.py", line 202, in
tf.app.run()
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 198, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/local/mnt/workspace/chris/projects/models/object_detection/trainer.py", line 310, in train
session_wrapper=tf_debug.LocalCLIDebugWrapperSession)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 777, in train
sv.stop(threads, close_summary_writer=True)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/framework.py", line 570, in wrapped_runner
callable_runner_args=runner_args)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/framework.py", line 532, in run
run_end_resp = self.on_run_end(run_end_req)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/local_cli_wrapper.py", line 319, in on_run_end
self._dump_root, partition_graphs=partition_graphs)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 690, in init
self._load_all_device_dumps(partition_graphs, validate)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 712, in _load_all_device_dumps
self._load_partition_graphs(partition_graphs, validate)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 1009, in _load_partition_graphs
self._validate_dump_with_graphs(device_name)
File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 1208, in _validate_dump_with_graphs
"device %s." % (datum.node_name, device_name))
ValueError: Node name 'parallel_read/filenames/Assert/Assert/data_0' is not found in partition graphs of device /job:localhost/replica:0/task:0/cpu:0.