rishizek / tensorflow-deeplab-v3-plus

DeepLabv3+ built in TensorFlow
MIT License
833 stars 307 forks source link

can't use deeplabv3_ver1.tar.gz for training or evaluate or inference #16

Open electronicYH opened 6 years ago

electronicYH commented 6 years ago

for training or evaluate or inference, it is the same errors:

**_NotFoundError (see above for traceback): Key decoder/low_level_features/conv_1x1/BatchNorm/beta not found in checkpoint

 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
 [[Node: save/RestoreV2/_301 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]_**
electronicYH commented 6 years ago

NotFoundError (see above for traceback): Key decoder/low_level_features/conv_1x1/BatchNorm/beta not found in checkpoint [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]] [[Node: save/RestoreV2/_301 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

rishizek commented 6 years ago

Hi @electronicYH , thank you for your interest in the repo.

Could you also provide me the exact command you type for, say, inference? I'm suspecting misunderstanding in arguments.

AshAswin commented 6 years ago

Hi Rishizek, Thanks a ton for your repo!

I am facing the same issue. I am using the default arguments that you have provided with the code and I have also ensured that Image folders, the model and Input file are placed in appropriate locations.

`parser.add_argument('--image_data_dir', type=str, default='./dataset/VOCdevkit/VOC2012/JPEGImages', help='The directory containing the image data.')

parser.add_argument('--label_data_dir', type=str, default='./dataset/VOCdevkit/VOC2012/SegmentationClassAug', help='The directory containing the ground truth label data.')

parser.add_argument('--evaluation_data_list', type=str, default='./dataset/val.txt', help='Path to the file listing the evaluation images.')

parser.add_argument('--model_dir', type=str, default='./model', help="Base directory for the model. " "Make sure 'model_checkpoint_path' given in 'checkpoint' file matches " "with checkpoint name.")

parser.add_argument('--base_architecture', type=str, default='resnet_v2_101', choices=['resnet_v2_50', 'resnet_v2_101'], help='The architecture of base Resnet building block.')

parser.add_argument('--output_stride', type=int, default=16, choices=[8, 16], help='Output stride for DeepLab v3. Currently 8 or 16 is supported.')

_NUM_CLASSES = 21`

For your information: the model directory contains model.ckpt-30358 that you have provided.

rishizek commented 6 years ago

Hi @AshAswin , thank you for your interest in the repo.

Could you try to run inference.py, instead of evaluation.py? There is some chance that label png files are unexpected format (As explained in README, the label png files need to be downloaded from DrSleep's repo). And you do not encounter problem when you using inference.py.

wangbin8611 commented 6 years ago

thank you for reply. my problem is that: I downloaded the deeplabv3plus.ver1.tar.gz and augmented sgmentation data from the site: www.dropbox.com, besides, I downloaded the PASCAL VOC dataset from the site:host.robots.ox.ac.uk, then I can train the model with pre-trained Restnet v2 101 model, VOC2012 and augmented sgmentation data, which can get the mIoU to 0.7553, BUT, then I exact the deeplabv3plus.ver1.tar.gz to the model directories, and use the model for evaluation or inference, they are both failed, and display the same error:

NotFoundError (see above for traceback): Key decoder/low_level_features/conv_1x1/BatchNorm/beta not found in checkpoint [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]] [[Node: save/RestoreV2/_301 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

I type the command: python evaluation.py python inference.py

wangbin8611 commented 6 years ago

another question, I want to train the model to segment lane from picture,then I preparede all kinds of road picture, and the label, which is mark the lane with white color(the pix value is [255,255,255]) and all other things is black(the pix value is [0,0,0]), the label picture format is color picture with 8 bit depth. I make the tfrecord file by lane picture and the label with the same way like VOC2012. BUT, when training, in the image web in the tensorboard, I found the label picture is all black, can't see anything, and so, the train work failed, please help me

rishizek commented 6 years ago

Hi @wangbin8611 , thank you for your interest in the repo and useful information.

It looks like the error on this thread occurs when the model architecture given in code and that of checkpoint mismatches. When I experiment to run inference.py of deeplabv3plus with the checkpoint for deeplabv3, which has slightly different architecture, the similar error was reproduced. Also, I updated the architecture and checkpoint of deeplabv3+ in the past. Therefore, I believe that either inference.py code or checkpoint is old. You can alos find similar issue here. Could you try to update the code and checkpoint to latest version and check if you still have the same error?

Regarding the other question, as you may know the dimension of label should be [width x height x1] and not [width x height x 3] like usual RGB color. Please refer to here for more detail. So you should set the id for lane as 1 and that for other (background) to 0, and to show them in correct color in tensorboard, you need to modify color mapping given here

I hope it answered your question.

prajakta-13 commented 6 years ago

Hi Rishizek,

As suggested in the comment above, I used inference.py which works fine. However inference.py doesn't have any metric for performance evaluation. I believe that is taken care in evaluation.py.

  1. Is there any means to get the metrics for performance along with inference.py?

  2. Also, could you please share the specific details for expected format of label png files for custom data?

  3. Here's the error that I face while running evaluate.py: InvalidArgumentError (see above for traceback): Tried to explicitly squeeze dimension 3 but dimension was not 1: 3 The error occurs at the sqeeze operation from deeplab_model.py. However I printed the dimensions of the labels tensor before and after squeezing and they seem to be correct.

Here are the complete logs for reference:

python3 evaluate.py RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb labels Tensor("IteratorGetNext:1", shape=(?, ?, ?, 1), dtype=int32) features Tensor("IteratorGetNext:0", shape=(?, ?, ?, 3), dtype=float32) Here are labels Tensor("IteratorGetNext:1", shape=(?, ?, ?, 1), dtype=int32) Here are features Tensor("IteratorGetNext:0", shape=(?, ?, ?, 3), dtype=float32) Here are labels after squeezing Tensor("Squeeze:0", shape=(?, ?, ?), dtype=int32) 2018-06-06 21:04:50.356765: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-06-06 21:04:50.461692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-06-06 21:04:50.461984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.759 pciBusID: 0000:01:00.0 totalMemory: 7.92GiB freeMemory: 4.52GiB 2018-06-06 21:04:50.462000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0 2018-06-06 21:04:50.664131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4270 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from ./Resnet30kModel-1/OutputModel/model.ckpt-17785 Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to explicitly squeeze dimension 3 but dimension was not 1: 3 [[Node: Squeeze = SqueezeT=DT_INT32, squeeze_dims=[3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_1263 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2146_confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "evaluate.py", line 151, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "evaluate.py", line 88, in main preds = sess.run(predictions) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to explicitly squeeze dimension 3 but dimension was not 1: 3 [[Node: Squeeze = SqueezeT=DT_INT32, squeeze_dims=[3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_1263 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2146_confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Squeeze', defined at: File "evaluate.py", line 151, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "evaluate.py", line 73, in main 'freeze_batch_norm': True File "/home/kpit/Aswin/SemanticAnnotation/tensorflow-deeplab-v3-plus/deeplab_model.py", line 207, in deeplabv3_plus_model_fn labels = tf.squeeze(labels, axis=3) # reduce the channel dimension. File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_ops.py", line 2568, in squeeze return gen_array_ops._squeeze(input, axis, name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 5169, in _squeeze "Squeeze", input=input, squeeze_dims=axis, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1650, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Tried to explicitly squeeze dimension 3 but dimension was not 1: 3 [[Node: Squeeze = SqueezeT=DT_INT32, squeeze_dims=[3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_1263 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2146_confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

rishizek commented 6 years ago

Hi @prajakta-13 , Thank you for your interest in the repo.

Let me answer your questions:

Is there any means to get the metrics for performance along with inference.py?

Yes, it is possible to implement performance metrics in inference.py. But to evaluate performance metrics, you need a corresponding grand truth label for each input images. inference.py is supposed to used any image even without its corresponding grand truth label. That's why I do not implement performance metrics in inference.py.

Also, could you please share the specific details for expected format of label png files for custom data?

As I mentioned above, the dimension of label should be [width x height x1] and not [width x height x 3] like usual RGB color. And each class should be represented with indexes, such as 0 for backgground, 1 for person, 2 for dog, etc. The labels of augmented segmentation data provided by DrSleep, are concrete examples of this.

Here's the error that I face while running evaluate.py: InvalidArgumentError (see above for traceback): Tried to explicitly squeeze dimension 3 but dimension was not 1: 3 The error occurs at the sqeeze operation from deeplab_model.py. However I printed the dimensions of the labels tensor before and after squeezing and they seem to be correct.

Well, what you printed out is the dimensons of the label tensor and not that of your label dataset (This is one of difficult part with TensorFlow). I suspect your label dataset is having the dimension of [width x height x 3], and because of that, (namely, cannot squeeze axis = 3, because it is not the form of [width x height x 1]), running evaluate.py is failed.

I hope this help you solve the problem.

prajakta-13 commented 6 years ago

Hi Rishizek,

Thank you so much for the answers. I did help solve my problem. I am now able to run the script evaluate.py.

DRACOyu commented 5 years ago

@rishizek hello, I meet the same question,as follows: Key Variable not found in checkpoint [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]] I removed the tf.train.init_checkpoint,this problem still remains, I use inceptionV3 as the baseline, from your answer, checkpoint needs to be updated as well.so how to update the checkpoint? hope your answer,thanks