_process_batch() does not work in deeplab vis.py without any error message

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04(VMware) and Google Colab
TensorFlow installed from (source or binary): using pip
TensorFlow version (use command below): tf.GIT_VERSION -> v1.14.0-0-g87989f6959 / tf.VERSION -> 1.14.0
Bazel version (if compiling from source): -
CUDA/cuDNN version: 10.0, V10.0.130
GPU model and memory: Tesla K80, 11441MB
Exact command to reproduce:

!python deeplab/vis.py \  ( ! exists in case of colab)
    --logtostderr \
    --vis_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --vis_crop_size="1025,2049" \
    --dataset="cityscapes" \
    --colormap_type="cityscapes" \
    --checkpoint_dir="~~~~/models/research/deeplab/datasets/cityscapes/exp/train" \
    --vis_logdir="~~~~/models/research/deeplab/datasets/cityscapes/exp/vis" \
    --dataset_dir="~~~~/models/research/deeplab/datasets/cityscapes/tfrecord" \
    (--max_number_of_iterations=1 -> optional)

Description of the problem

Hi, smart and brilliant and amazing people. I need your help.

I'm trying to use deeplab so I was following the installation guide and running guide on cityscapes. It works well until converting datasets(sh convert_cityscapes.sh command) and running train job(python deeplab/train.py) using appropriate pretrained model. However, it does not work on evaluation job and visualization job. It does not work without throwing any error messages at the actual start of each job.

For evaluation job, after printing

I0703 11:15:41.912175 140342253922176 evaluation.py:450] Starting evaluation at 2019-07-03-11:15:41

If I add one more flag, --max_number_of_evaluations=1, then it just stops after printing the sentence.

For visualization job, after printing

I0703 11:18:00.007635 139958688167808 session_manager.py:500] Running local_init_op.
I0703 11:18:00.086654 139958688167808 session_manager.py:502] Done running local_init_op.
I0703 11:18:00.861741 139958688167808 vis.py:296] Visualizing batch 1
I0703 11:18:01.962661 139958688167808 vis.py:312] Finished visualization at 2019-07-03-11:18:01

If I add one more flag, --max_number_of_iterations=1, then it just stops after printing these sentences.

I tried to figure out what is problem so I added print code in vis.py like below.

print('before')
_process_batch(sess=sess,
                         original_images=samples[common.ORIGINAL_IMAGE],
                         semantic_predictions=predictions,
                         image_names=samples[common.IMAGE_NAME],
                         image_heights=samples[common.HEIGHT],
                         image_widths=samples[common.WIDTH],
                         image_id_offset=image_id_offset,
                         save_dir=save_dir,
                         raw_save_dir=raw_save_dir,
                         train_id_to_eval_id=train_id_to_eval_id)
print('after')

As a result, it never prints after although it prints before, which means that _process_batch function does not work. It makes raw_segmentation_results folder and segmentation_results folder but they are empty.

I totally cannot understand why it does not work. I will writes some information which can be helpful to recognize and diagnose the problem.

Additional information

At first, I tried on VMware workstation15 with Ubuntu 14.04 because installation.md mentioned. After failing with this problem, tried with Ubuntu 18.04 because it is latest LTS version but same problem appeared. I guessed there might be problem in GPU memory or something like hardware so I expand VMware's GPU memory to 3GB, and number of CPU cores to 4, etc. The problem was prosperous.

Using VMware can be a reason so I tried to use Google Colab. The phenomenon is same! It works well until doing train job but it does not on evaluation and visualization job. Specifically, _process_batch function never work.

I used gtFine data in cityscapes dataset. (if you want to download it, you should sign up.) Pretrained model for xception65. Directory Structure is like below

+ datasets
  + cityscapes
    + gtFine (downloaded data)
    + tfrecord (transformed data using convert_cityscapes.sh)
    + exp
      + train_on_train_set
        + train (result of train job located in)
        + eval (result of evaluation job should be located in this folder)
        + vis (result of visualization job should be located in this folder)
        + initial (pretrained data located in)

If you have some idea, I'll very appreciate you for saving me from this slough.

tensorflow / models