tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

[deeplab] the error of running deeplab on VOC data set #4515

Closed surfreta closed 5 years ago

surfreta commented 6 years ago

Hello, I followed the tutorial to run the PASCAL VOC 2012 data set, and I did not modify anything. This is the command line I used.

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=30000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size=513 \
    --train_crop_size=513 \
    --train_batch_size=1 \
    --dataset="pascal_voc_seg" \
    --tf_initial_checkpoint="/data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt" \
    --train_logdir="/data/DL-Phase3/carvana/train_on_train_set/train" \
    --dataset_dir="/data/DL-Phase3/VOCdevkit/VOC2012/tfrecord"   

And I got the following error. The major two issues, seems to me, are

WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt

and

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values                                                                                                          
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]                                                                                                                   
WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                              
WARNING:tensorflow:Variable aspp2_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                           
WARNING:tensorflow:Variable aspp2_pointwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                              
WARNING:tensorflow:Variable aspp1_depthwise/depthwise_weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                         
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/moving_variance missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                          
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                            
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                            
WARNING:tensorflow:Variable aspp3_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                           
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                            
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                           
WARNING:tensorflow:Variable aspp0/BatchNorm/moving_variance missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                            
WARNING:tensorflow:Variable aspp3_pointwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                           
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                            
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/beta missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                     
WARNING:tensorflow:Variable image_pooling/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                
WARNING:tensorflow:Variable aspp3_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable image_pooling/weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                     
WARNING:tensorflow:Variable aspp0/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/beta missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                   
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                   
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                              
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                    
WARNING:tensorflow:Variable image_pooling/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                      
WARNING:tensorflow:From /data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.                                                                                                                                    
Instructions for updating:                                                                                                            
Please switch to tf.train.MonitoredTrainingSession                                                                                    
2018-06-12 18:32:03.287833: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX                                                                                      
INFO:tensorflow:Restoring parameters from /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                      
INFO:tensorflow:Starting Session.                                                                                                     
INFO:tensorflow:Saving checkpoint to path /data/DL-Phase3/carvana/train_on_train_set/train/model.ckpt               
INFO:tensorflow:Starting Queues.                                                                                                      
INFO:tensorflow:global_step/sec: 0                                                                                                    
INFO:tensorflow:Recording summary at step 0.                                                                                          
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values                                                                                                          
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]                                                                                                                   

Caused by op 'CheckNumerics', defined at:
  File "deeplab/train.py", line 392, in <module>
    tf.app.run()                                
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))                                                                                                             
  File "deeplab/train.py", line 335, in main                                                                                          
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')                                                                 
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 565, in check_numerics                                                                                                                      
    "CheckNumerics", tensor=tensor, message=message, name=name)                                                                       
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper                                                                                                             
    op_def=op_def)                                                                                                                    
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op                                                                                                                              
    op_def=op_def)                                                                                                                    
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__                                                                                                                               
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access                                                

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]                                                                                                                   

Traceback (most recent call last):
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call                                                                                                                              
    return fn(*args)                                                                                                                  
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn                                                                                                                               
    status, run_metadata)                                                                                                             
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__                                                                                                                        
    c_api.TF_GetCode(self.status.status))                                                                                             
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values                             
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]                                                                                                                   

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deeplab/train.py", line 392, in <module>
    tf.app.run()                                
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))                                                                                                             
  File "deeplab/train.py", line 385, in main                                                                                          
    save_interval_secs=FLAGS.save_interval_secs)                                                                                      
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train                                                                                                                      
    sess, train_op, global_step, train_step_kwargs)                                                                                   
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step                                                                                                                 
    run_metadata=run_metadata)                                                                                                        
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run                                                                                                                                    
    run_metadata_ptr)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]

Caused by op 'CheckNumerics', defined at:
  File "deeplab/train.py", line 392, in <module>
    tf.app.run()
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "deeplab/train.py", line 335, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 565, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]
tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

qlzh727 commented 6 years ago

It seems that either the check point is not loaded correctly, or the content of the checkpoint does not match the model itself. Assigning to someone from deeplab to take a further look.

BlueWinters commented 6 years ago

Hi, I meet the same problem. I think @qlzh727 is right. And** there some difference: Key aspp0/BatchNorm/beta not found in checkpoint.

Harshini-Gadige commented 5 years ago

Closing as this is resolved. Please add comments if any, we will reopen. Thanks !