tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.76k forks source link

tfdbg on slim not compatible with Object Detection API #2328

Closed bwuzhang closed 4 years ago

bwuzhang commented 7 years ago

System information

Describe the problem

I want to debug a NaN error with tfdbg on the Object Detection API rfcn network. I checkout latest version of 'tensorflow/tensorflow/contrib/slim/python/slim/learning.py' to include the tfdbg support. After hitting 'run' twice in tfdbg, I encountered the following error. This can be reproduced with the pets example by adding 'session_wrapper=tf_debug.LocalCLIDebugWrapperSession' to the 'slim.learning.train' in 'models/object_detection/trainer.py'.

Source code / logs

2017-09-03 13:04:11.473343: I tensorflow/core/debug/debug_graph_utils.cc:229] For debugging, tfdbg is changing the parallel_iterations attribute of the Enter/RefEnter node "gradients/map/while/TensorArrayReadV3/Enter_1_grad/b_acc_1" on device "/job:localhost/replica:0/task:0/gpu:0" from 16 to 1. (This does not affect subsequent non-debug runs.) INFO:tensorflow:Error reported to Coordinator: <type 'exceptions.ValueError'>, Node name 'parallel_read/filenames/Assert/Assert/data_0' is not found in partition graphs of device /job:localhost/replica:0/task:0/cpu:0. Traceback (most recent call last): File "object_detection/train.py", line 202, in tf.app.run() File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "object_detection/train.py", line 198, in main worker_job_name, is_chief, FLAGS.train_dir) File "/local/mnt/workspace/chris/projects/models/object_detection/trainer.py", line 310, in train session_wrapper=tf_debug.LocalCLIDebugWrapperSession) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 777, in train sv.stop(threads, close_summary_writer=True) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/contextlib.py", line 35, in exit self.gen.throw(type, value, traceback) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run enqueue_callable() File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/framework.py", line 570, in wrapped_runner callable_runner_args=runner_args) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/framework.py", line 532, in run run_end_resp = self.on_run_end(run_end_req) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/local_cli_wrapper.py", line 319, in on_run_end self._dump_root, partition_graphs=partition_graphs) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 690, in init self._load_all_device_dumps(partition_graphs, validate) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 712, in _load_all_device_dumps self._load_partition_graphs(partition_graphs, validate) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 1009, in _load_partition_graphs self._validate_dump_with_graphs(device_name) File "/local/mnt/workspace/chris/anaconda2/lib/python2.7/site-packages/tensorflow/python/debug/lib/debug_data.py", line 1208, in _validate_dump_with_graphs "device %s." % (datum.node_name, device_name)) ValueError: Node name 'parallel_read/filenames/Assert/Assert/data_0' is not found in partition graphs of device /job:localhost/replica:0/task:0/cpu:0.

jart commented 7 years ago

@caisq Here's a tfdbg related report on TensorFlow Models.

Sharathnasa commented 7 years ago

@bwuzhang did you checkout the code and build the tensorflow using bazel inorder to make use of tensorflow debugger?

Because i updated to latest version of tf, it was throwing error saying 'session.wrapper' is an unexpected keyword argument

Please let me know

bwuzhang commented 7 years ago

Yes I built it from source. r1.3 doesn't have tfdbg support for slim so I have to checkout the latest version of tensorflow/tensorflow/contrib/slim/python/slim/learning.py and built it from source on r1.3 branch.

Sharathnasa commented 7 years ago

@bwuzhang Thanks for the reply.

xiyan524 commented 6 years ago

@bwuzhang did you solve this problem? I met the same question, and do not know how to continue.

HuangJianlang commented 6 years ago

Did you solve this problem? I met the same issue

bysowhat commented 6 years ago

you can print every value eagerly in object detection with this py file. https://github.com/bysowhat/object_detection_debug

varun19299 commented 6 years ago

You can use:

 slim.learning.train(...
         session_wrapper=tf_debug.LocalCLIDebugWrapperSession
         ....)
tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.