sbucaille commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[X] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[X] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/deeplab/vis.py https://github.com/tensorflow/models/blob/master/research/deeplab/eval.py

2. Describe the bug

For a school project, I need to train a deep neural network using transfer learning to do segmentation on the PASCAL VOC 2009 dataset, I chose DeeplabV3+. I could tweak the installation steps to adapt it to PASCAL VOC2009 which was ok since it uses the same convention. My training worked fine but when I evaluate my model I got 0 of MIOU accuracy on every classes except the "0" one (which I assume is the background). Apart from that, when I try to visualize it, I got a KeyError from vis.py which I can't find anyone else having the problem.

3. Steps to reproduce

So I tweaked the the download_and_convert_voc2012.sh file to a custom one, but since it is the same convention, nothing really changed, here is the code :

set -e

CURRENT_DIR=$(pwd)
WORK_DIR="./"
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
mkdir -p "${WORK_DIR}"
cd "${WORK_DIR}"

cd "${CURRENT_DIR}"

# Root path for PASCAL VOC 2012 dataset.
PASCAL_ROOT="${WORK_DIR}/${1}"

# Remove the colormap in the ground truth annotations.
SEG_FOLDER="${PASCAL_ROOT}/SegmentationClass"
SEMANTIC_SEG_FOLDER="${PASCAL_ROOT}/SegmentationClassRaw"

echo "Removing the color map in ground truth annotations..."
python3 "${SCRIPT_DIR}/remove_gt_colormap.py" \
  --original_gt_folder="${SEG_FOLDER}" \
  --output_dir="${SEMANTIC_SEG_FOLDER}"

# Build TFRecords of the dataset.
# First, create output directory for storing TFRecords.
OUTPUT_DIR="${WORK_DIR}/tfrecord"
mkdir -p "${OUTPUT_DIR}"

IMAGE_FOLDER="${PASCAL_ROOT}/JPEGImages"
LIST_FOLDER="${PASCAL_ROOT}/ImageSets/Segmentation"

echo "Converting PASCAL VOC 2009 dataset..."
python3 "${SCRIPT_DIR}/build_voc2012_data.py" \
  --image_folder="${IMAGE_FOLDER}" \
  --semantic_segmentation_folder="${SEMANTIC_SEG_FOLDER}" \
  --list_folder="${LIST_FOLDER}" \
  --image_format="jpg" \
  --output_dir="${OUTPUT_DIR}"

This, indeed produces correct tfrecords. I changed data_generator.py to include my own dataset information :

_PASCAL_VOC2009_SEG_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train' : 1049,
        'val' : 224,
        'test' : 226
    },
    num_classes=21,
    ignore_label=255
)

I'm using it on Google Colab to use GPU's. After having all the data where it is supposed to be, aswell as the model checkpoint xception71_dpc_cityscapes_trainval from https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md.

I run the following command to train :

!python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=30000 \
    --train_split="train" \
    --model_variant="xception_71" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --fine_tune_batch_norm=False\
    --decoder_output_stride=4 \
    --train_crop_size="513,513" \
    --train_batch_size=4 \
    --dataset="pascal_2009" \
    --tf_initial_checkpoint="/content/models/research/deeplab/datasets/trainval_fine/model.ckpt.index" \
    --train_logdir="/content/drive/My Drive/exp_transfer/train_on_train_set/train" \
    --dataset_dir="/content/models/research/deeplab/datasets/tfrecord"

This command works fine, I got my chekpoint which I use for evaluation and visualization.

Evalutation : I run this following command :

!python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_71" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size="513,513" \
    --dataset="pascal_2009" \
    --eval_batch_size=1 \
    --checkpoint_dir="/content/drive/My Drive/exp_scratch/train_on_train_set/train" \
    --eval_logdir="/content/drive/My Drive/exp_scratch/train_on_train_set/eval" \
    --dataset_dir="/content/models/research/deeplab/datasets/tfrecord" \
    --max_number_of_evaluations=1

But here are the results :

eval/miou_1.0_class_15[0]
eval/miou_1.0_class_1[0]
eval/miou_1.0_class_10[0]
eval/miou_1.0_class_8[0]
eval/miou_1.0_class_17[0]
eval/miou_1.0_class_20[0]
eval/miou_1.0_class_0[0.805149138]
eval/miou_1.0_class_3[0]
eval/miou_1.0_class_7[0]
eval/miou_1.0_class_14[0]
eval/miou_1.0_class_18[0]
eval/miou_1.0_class_16[0]
eval/miou_1.0_class_19[0]
eval/miou_1.0_class_4[0]
eval/miou_1.0_overall[0.0383404382]
eval/miou_1.0_class_2[0]
eval/miou_1.0_class_9[0]
eval/miou_1.0_class_12[0]
eval/miou_1.0_class_6[0]
eval/miou_1.0_class_13[0]
eval/miou_1.0_class_11[0]
eval/miou_1.0_class_5[0]

Here the performance are 0 for all the classes that are not 0 (which I assume to be the background)

Then visualization command :

!python deeplab/vis.py \
    --logtostderr \
    --vis_split="test" \
    --model_variant="xception_71" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --vis_crop_size="513,513" \
    --dataset="pascal_2009" \
    --vis_batch_size=1 \
    --colormap_type="pascal" \
    --checkpoint_dir="/content/drive/My Drive/exp_scratch/train_on_train_set/train" \
    --vis_logdir="/content/drive/My Drive/exp_scratch/train_on_train_set/vis" \
    --dataset_dir="/content/models/research/deeplab/datasets/tfrecord" \
    --max_number_of_iterations=1

And here I got this error :

Traceback (most recent call last):
  File "deeplab/vis.py", line 327, in <module>
    tf.app.run()
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "deeplab/vis.py", line 228, in main
    samples = dataset.get_one_shot_iterator().get_next()
  File "/content/models/research/deeplab/datasets/data_generator.py", line 339, in get_one_shot_iterator
    .map(self._preprocess_image, num_parallel_calls=self.num_readers))
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/data/ops/dataset_ops.py", line 1913, in map
    self, map_func, num_parallel_calls, preserve_cardinality=False))
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/data/ops/dataset_ops.py", line 3472, in __init__
    use_legacy_function=use_legacy_function)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/data/ops/dataset_ops.py", line 2713, in __init__
    self._function = wrapper_fn._get_concrete_function_internal()
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/eager/function.py", line 1853, in _get_concrete_function_internal
    *args, **kwargs)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/eager/function.py", line 1847, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/eager/function.py", line 2147, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/eager/function.py", line 2038, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/data/ops/dataset_ops.py", line 2707, in wrapper_fn
    ret = _wrapper_helper(*args)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/data/ops/dataset_ops.py", line 2652, in _wrapper_helper
    ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.pyct.errors.KeyError: in converted code:

    /content/models/research/deeplab/datasets/data_generator.py:295 _preprocess_image
        label = sample[common.LABELS_CLASS]

    KeyError: 'labels_class'

4. Expected behavior

A clear and concise description of what you expected to happen.

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
TensorFlow installed from (source or binary):
TensorFlow version (use command below): tf.version.VERSION = 1.15.2
Python version: python version: 3.6.9
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: /usr/local/lib/python2.7/dist-packages/torch/lib/libcudart-1b201d85.so.10.1 /usr/local/lib/python3.6/dist-packages/torch/lib/libcudart-1b201d85.so.10.1 /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243 /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart_static.a /usr/local/cuda-10.1/doc/man/man7/libcudart.so.7 /usr/local/cuda-10.1/doc/man/man7/libcudart.7 /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart_static.a /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130 /usr/local/cuda-10.0/doc/man/man7/libcudart.so.7 /usr/local/cuda-10.0/doc/man/man7/libcudart.7
GPU model and memory: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.82 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 | | N/A 33C P0 31W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+

mrheffels commented 3 years ago

Hi @sbucaille , not sure if this is still relevant to you but here goes.

Believe it or not, I had the exact same error code and I found out that the error actually comes from the naming of the split. To make it more clear, I found out about this because I created two new splits, "trainval" and "test". Trainval was working fine, but test wasn't.

There is a FILE_PATTERN expression which apparently causes any split with an 's' in there to fail. When I changed the split from 'test' to 'tet' it actually worked fine. I have to say that this worked for me on the eval.py script, I'm not sure about the vis.py script. Hope it helps you out.

saramsv commented 3 years ago

@mrheffels Thank you so much! That actually fixed my problem! I wouldn't have thought that the split name might be the source of the issue :)

tensorflow / models

[Deeplab] PASCALVOC2009 dataset 0 MIOU on classes other than 0 + can't visualize KeyError: 'labels_class' #8507