[deeplab] Evaluation with pre-trained model does not match provided value

rogercw commented 6 years ago

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary (pip install --upgrade)
TensorFlow version (use command below): 1.6.0
Bazel version (if compiling from source):
CUDA/cuDNN version: V9.0.176 / 7.0.5
GPU model and memory: GeForce GTX 1080 Ti / 10.91GiB
Exact command to reproduce:

Describe the problem

I downloaded the pre-trained model 'xception_coco_voc_trainaug' from model zoo, and used it as "checkpoint_dir" for the evaluation. Since there is no 'checkpoint' file included in the tar file, I manually created one with both "model_checkpoint_path" and "all_model_checkpoint_paths" assigned to the downloaded file "model.ckpt" (evaluation will not run without 'checkpoint' file.).

However, after I ran the 'eval.py' with the command in 'local_test.sh', the "miou_1.0" I got is 0.613665, which is way less than the the expected number 82.20%. May I know what I might do wrong here? Thanks.

P.S. I originally planned to post this question in StackOverflow. However, there is no 'deeplab' avaliable yet. and I do not have enough reputation to create it.

Source code / logs

python "${WORK_DIR}"/eval.py \

--logtostderr \ --eval_split="val" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --eval_crop_size=513 \ --eval_crop_size=513 \ --checkpoint_dir="${TRAIN_LOGDIR}" \ --eval_logdir="${EVAL_LOGDIR}" \ --dataset_dir="${PASCAL_DATASET}" --max_number_of_evaluations=1
/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters INFO:tensorflow:Evaluating on val set INFO:tensorflow:Performing single-scale test. INFO:tensorflow:Eval num images 1449 INFO:tensorflow:Eval batch size 1 and num batch 1449 INFO:tensorflow:Waiting for new checkpoint at /pascal_voc/exp/train_on_trainval_set/train0 INFO:tensorflow:Found new checkpoint at /pascal_voc/exp/train_on_trainval_set/train0/model.ckpt WARNING:tensorflow:From /lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step INFO:tensorflow:Graph was finalized. 2018-03-20 11:29:44.459901: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-03-20 11:29:48.065719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0e:00.0 totalMemory: 10.91GiB freeMemory: 10.75GiB 2018-03-20 11:29:48.066463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0 2018-03-20 11:29:48.445285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10409 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0e:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from /pascal_voc/exp/train_on_trainval_set/train0/model.ckpt INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting evaluation at 2018-03-20-18:29:53 INFO:tensorflow:Evaluation [144/1449] INFO:tensorflow:Evaluation [288/1449] INFO:tensorflow:Evaluation [432/1449] INFO:tensorflow:Evaluation [576/1449] INFO:tensorflow:Evaluation [720/1449] INFO:tensorflow:Evaluation [864/1449] INFO:tensorflow:Evaluation [1008/1449] INFO:tensorflow:Evaluation [1152/1449] INFO:tensorflow:Evaluation [1296/1449] INFO:tensorflow:Evaluation [1440/1449] INFO:tensorflow:Evaluation [1449/1449] INFO:tensorflow:Finished evaluation at 2018-03-20-18:31:19 miou_1.0[0.613665]

aquariusjay commented 6 years ago

Could you please try running the simple test, sh local_test.sh (without modifying anything including renaming the checkpoints)? The script will simply train the same checkpoint you are using with 10 iterations (you could modify it if you like), eval the model (should return mIOU around 82.20% if number of iterations are small) and visualize some results. Once you could reproduce the results, we then try renaming the checkpoints and so on.

rogercw commented 6 years ago

Thanks for the quick response, @aquariusjay! I can get the expected outcome by running the local_test.sh. It turns out that my PASCAL dataset are somehow polluted, and that is why I got lower value earlier. After switch to the new downloaded dataset, I can get the same value by directly running 'eval.py' as well. Thanks for the help.

As a side note, I needed to make 'download_convert_voc2012.sh' executable and replace sh download_convert_voc2012.sh with ./download_convert_voc2012.sh in 'local_test.sh' first, otherwise, I will bump into: download_and_convert_voc2012.sh: 43: download_and_convert_voc2012.sh: Syntax error: "(" unexpected. Maybe it's only my environment issue though.

kr-ish commented 6 years ago

@rogercooper76, re: the side note about 'download_convert_voc2012.sh', see #3669

aquariusjay commented 6 years ago

Good job on figuring out the problem and glad to know that the issue is resolved. Closing this issue.

wldeephi commented 6 years ago

hello, @rogercooper76, when I run the eval.py on cityscapes_dataset, it runs on cpu, and when I run the train.py, it runs on GPUs, can you give me some suggestions to solve the problem?

rogercw commented 6 years ago

Hi @wldeephi, multiple GPUs are needed to run both scripts at the same time. For me, I added "CUDA_VISIBLE_DEVICES=$GPUID" to specify which GPU to use. For example, I might run training with "CUDA_VISIBLE_DEVICES=0 python train.py ..." and evaluation with "CUDA_VISIBLE_DEVICES=1 python eval.py ...". Not sure whether it helps.

liangjianfans commented 6 years ago

Hi rogercooper76, after running the eval.py, I got the result like miou_1.0[1], what does that mean? Thanks.

zibuyu2018 commented 6 years ago

I got the result like miou_1.0[0.935625196],I used python3

feixuedudiao commented 6 years ago

@aquariusjay when i train on ms coco 2014 with mobilenetv2 pretrain model, i input the cropsize [641,641] and [657,657], running eval.py reported the error"tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [0 0 0...] [y (mean_iou/ToInt64_2:0) = ] [81]".I google the method solved this error, but can't find, and find many people are confused by the error. who can tell me the solved method?

aquariusjay commented 6 years ago

When evaluating on coco images, setting eval_crop_size = [641, 641] should resolve the problem. It seems that your model predicts something larger than expected. For COCO, you need to set num_classes=91 in the segmentation_dataset.py.

tensorflow / models