tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.76k forks source link

INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1 #3716

Closed GeorgeBohw closed 6 years ago

GeorgeBohw commented 6 years ago

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

When I run local_test.sh,i only modify --***_crop_size to 1000,then the error comes out:

**_INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1

 [[Node: image_pooling/BatchNorm/moving_variance_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_1/tag, image_pooling/BatchNorm/moving_variance/read)]]
 [[Node: xception_65/middle_flow/block1/unit_13/xception_module/separable_conv3_pointwise/weights/read/_617 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2728_...ights/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op u'image_pooling/BatchNorm/moving_variance_1', defined at: File "/home/george/project/deeplabv3/models-master/research/deeplab/train.py", line 347, in tf.app.run() File "/home/george/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/home/george/project/deeplabv3/models-master/research/deeplab/train.py", line 268, in main summaries.add(tf.summary.histogram(model_var.op.name, model_var)) File "/home/george/anaconda2/lib/python2.7/site-packages/tensorflow/python/summary/summary.py", line 193, in histogram tag=tag, values=values, name=scope) File "/home/george/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 189, in _histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "/home/george/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/george/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op op_def=op_def) File "/home/george/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in init self._traceback = self._graph._extractstack() # pylint: disable=protected-access**

What is the reason? 1000 is too large?If I want to use the model to test 1920*1080 size image,how can I do? I am looking forward to your response,thank you!

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

walkerlala commented 6 years ago

I am also faced with exactly this problem. Try reducing the number of training step (--training_number_of_steps) helps. Note that you have to remove the checkpoints file before rerun, otherwise it will try to start from the previous checkpoint, resulting in the same result.

walkerlala commented 6 years ago

It seems that this is really due to limited GPU memory. As stated in the document, setting --fine_tune_batch_norm=False will solve this problem. I tried setting this option and can be able to train with a training step of 30,000 now ;-)

GeorgeBohw commented 6 years ago

@walkerlala I set --fine_tune_batch_norm=False and crop_size=1000,the same error occurs,so what's wrong with it,what's ur crop_size?I just want run the model on big size image like 720p,any other solution? Thanks!

GeorgeBohw commented 6 years ago

@walkerlala I think i ca't fine-tuning the pre-trained model using image in higher solution, the only way i can do is to use the pre-trained model with image with higher image,and i don't know how to use.I seems that the pre-trained model just fits images with maxium size 513*513.

walkerlala commented 6 years ago

My crop size is 513*513

GeorgeBohw commented 6 years ago

@walkerlala Thanks for ur response!

walkerlala commented 6 years ago

@GeorgeBohw Does that solve your problem?

GeorgeBohw commented 6 years ago

@walkerlala Yeah,it solved the problem,thanks~

Adnation commented 6 years ago

I am facing similar issue in SSD Inception V2 and can you help me how can I solve this? How can I set crop size or reduce training_number_of_steps or fine_tune_batch_norm? What exact solution worked for you?

Bahramudin commented 6 years ago

@walkerlala I am also facing the same problem, but no solution did work. What should I do? Thanks!

ybxbupt commented 5 years ago

@walkerlala I set --fine_tune_batch_norm=False and crop_size=1000,the same error occurs,so what's wrong with it,what's ur crop_size?I just want run the model on big size image like 720p,any other solution? Thanks!

delete all files in train_logdir dir, begin a new train, it will work

blutjens commented 5 years ago

After I've executed the train command and ran into this error, changing training_number_of_steps and fine_tune_batch_norm did not help. I have quite radically deleted the folder pascal_voc_seg and redownloaded, unpacked and installed voc2012 and the deeplabv3_pascal_train_aug. Now, running train.py with changing training_number_of_steps and fine_tune_batch_norm got rid of the problem. However, ybxbupt measures might be sufficient.

DahuoJ commented 5 years ago

@walkerlala Hello! my crop size is 257257, and it works. But if I use a bigger size, 481481 or 513*513, it won't work. Try reducing the number of training step (--training_number_of_steps) and (--learining_rate) no helps. What should I do next to improve my result? Thanks!

Jilliansea commented 5 years ago

I have faced the same problem, have you sovled it please?@Adnation

2017TJM commented 5 years ago

INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d _5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance', defined at: File "train.py", line 184, in tf.app.run() File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\util\deprecation.py", line 250, in new_func return func(*args, **kwargs) File "train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "D:\pycharm project deeplearning\models\research\object_detection\legacy\trainer.py", line 354, in train model_var.op.name, model_var)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\summary\summary.py", line 187, in histogram tag=tag, values=values, name=scope) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 309, in histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 3414, in create_op op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2 d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]] Traceback (most recent call last): File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call return fn(*args) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_poi ntwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\coordinator.py", line 297, in stop_on_exception yield File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\coordinator.py", line 495, in run self.run_loop() File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\supervisor.py", line 1035, in run_loop self._sv.global_step]) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 900, in run run_metadata_ptr) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run run_metadata) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_poi ntwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance', defined at: File "train.py", line 184, in tf.app.run() File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\util\deprecation.py", line 250, in new_func return func(*args, **kwargs) File "train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "D:\pycharm project deeplearning\models\research\object_detection\legacy\trainer.py", line 354, in train model_var.op.name, model_var)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\summary\summary.py", line 187, in histogram tag=tag, values=values, name=scope) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 309, in histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 3414, in create_op op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2 d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d _5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance', defined at: File "train.py", line 184, in tf.app.run() File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\util\deprecation.py", line 250, in new_func return func(*args, **kwargs) File "train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "D:\pycharm project deeplearning\models\research\object_detection\legacy\trainer.py", line 354, in train model_var.op.name, model_var)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\summary\summary.py", line 187, in histogram tag=tag, values=values, name=scope) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 309, in histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 3414, in create_op op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2 d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]] Traceback (most recent call last): File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call return fn(*args) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_poi ntwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\coordinator.py", line 297, in stop_on_exception yield File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\coordinator.py", line 495, in run self.run_loop() File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\supervisor.py", line 1035, in run_loop self._sv.global_step]) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 900, in run run_metadata_ptr) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run run_metadata) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_poi ntwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance', defined at: File "train.py", line 184, in tf.app.run() File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\util\deprecation.py", line 250, in new_func return func(*args, **kwargs) File "train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "D:\pycharm project deeplearning\models\research\object_detection\legacy\trainer.py", line 354, in train model_var.op.name, model_var)) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\summary\summary.py", line 187, in histogram tag=tag, values=values, name=scope) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 309, in histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 3414, in create_op op_def=op_def) File "D:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2 d_5_3x3_s2_128/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance = HistogramSummary[T =DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3s2 128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]] [[Node: cond_7/one_hot/_147 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/ job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_903_cond_7/one_hot", tensor_type=DT_INT32, _device="/ job:localhost/replica:0/task:0/device:GPU:0"]()]]

how do I solve the problem? every body can help me ?please!

ghost commented 5 years ago

@walkerlala Yeah,it solved the problem,thanks~

i meet the problem also, can you share how do you do to solve the problem, just like the method above mentioned? i did it, but failed again.

code-locker commented 2 years ago

Hi, I want to train custom datasets using ssdMobileNet-V1 using Tensorflow-gpu 1.15. I am facing below issues for the same.

Relying on driver to perform ptx compilation. This message will be only logged once. 2022-02-04 10:42:44.152817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 INFO:tensorflow:Saving checkpoint to path train/model.ckpt I0204 10:43:59.338148 140653890541312 supervisor.py:1117] Saving checkpoint to path train/model.ckpt INFO:tensorflow:Recording summary at step 0. I0204 10:44:34.073986 140653915719424 supervisor.py:1050] Recording summary at step 0. INFO:tensorflow:Error reported to Coordinator: 2 root error(s) found. (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean [[node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean (defined at /home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[FeatureExtractor/MobilenetV1/Conv2d_9_depthwise/BatchNorm/gamma/read/_1521]] (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean [[node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean (defined at /home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean': File "train.py", line 186, in tf.app.run() File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/mlai/.local/lib/python3.7/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/mlai/.local/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func return func(*args, *kwargs) File "train.py", line 182, in main graph_hook_fn=graph_rewriter_fn) File "/home/mlai/.local/lib/python3.7/site-packages/object_detection/legacy/trainer.py", line 353, in train model_var.op.name, model_var)) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/summary/summary.py", line 179, in histogram tag=tag, values=values, name=scope) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 329, in histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, *kwargs) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack() Traceback (most recent call last): File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(args) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean [[{{node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean}}]] [[FeatureExtractor/MobilenetV1/Conv2d_9_depthwise/BatchNorm/gamma/read/_1521]] (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean [[{{node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

How can I solve this issue? Please provide your valuable inputs for the same. So that I can continue.

Thanks and Regards, abhishek-ml-ai