tensorflow / models

Models and examples built with TensorFlow
Other
76.99k stars 45.79k forks source link

Object detection validation very slow on custom dataset #6106

Open KuribohG opened 5 years ago

KuribohG commented 5 years ago

System information

Describe the problem

Although I set the num_examples exactly the same with pet dataset, the evaluation in the training process is extremely slow. When INFO:tensorflow:Done running local_init_op. shows, it stuck for a long time. And the step Evaluate annotation type *bbox* takes about four hours. Are there anyway reduce the evaluation time?

Source code / logs

The eval part of train.config:

eval_config: {
  metrics_set: "coco_detection_metrics"
  num_examples: 1101
}

Logs:

INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2961: models/model.ckpt-2961
INFO:tensorflow:Saving checkpoints for 2962 into models/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-28-02:35:41
INFO:tensorflow:Graph was finalized.
2019-01-28 10:35:43.311144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-28 10:35:43.311247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-28 10:35:43.311260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-01-28 10:35:43.311267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-01-28 10:35:43.311647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) ->
 physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:0c:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from models/model.ckpt-2962
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Ignoring ground truth with image id 1823103654 since it was previously added
WARNING:tensorflow:Ignoring detection with image id 1823103654 since it was previously added
creating index...
index created!
INFO:tensorflow:Loading and preparing annotation results...
INFO:tensorflow:DONE (t=28.37s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=13277.25s).
Accumulating evaluation results...
DONE (t=179.29s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.097
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.275
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.047
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.128
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.009
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.068
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.187
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.249
INFO:tensorflow:Finished evaluation at 2019-01-28-11:17:10
INFO:tensorflow:Saving dict for global step 2962: DetectionBoxes_Precision/mAP = 0.097453624, DetectionBoxes_Precision/mAP (large) = 0.12816702, DetectionBoxes_Precision/mAP (medium
) = 0.004950495, DetectionBoxes_Precision/mAP (small) = 0.0, DetectionBoxes_Precision/mAP@.50IOU = 0.27519244, DetectionBoxes_Precision/mAP@.75IOU = 0.046542585, DetectionBoxes_Reca
ll/AR@1 = 0.009249808, DetectionBoxes_Recall/AR@10 = 0.06791596, DetectionBoxes_Recall/AR@100 = 0.18733716, DetectionBoxes_Recall/AR@100 (large) = 0.2487546, DetectionBoxes_Recall/A
R@100 (medium) = 0.0002144508, DetectionBoxes_Recall/AR@100 (small) = 0.0, Loss/BoxClassifierLoss/classification_loss = 0.66864246, Loss/BoxClassifierLoss/localization_loss = 2.3871
467, Loss/RPNLoss/localization_loss = 1.0954195, Loss/RPNLoss/objectness_loss = 0.3143284, Loss/total_loss = 4.4655395, global_step = 2962, learning_rate = 1e-04, loss = 4.4655395
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2962: models/model.ckpt-2962
INFO:tensorflow:Saving checkpoints for 2963 into models/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-28-11:17:23
INFO:tensorflow:Graph was finalized.
2019-01-28 19:17:24.259390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-28 19:17:24.259465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-28 19:17:24.259475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-01-28 19:17:24.259482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-01-28 19:17:24.259647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) ->
 physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:0c:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from models/model.ckpt-2963
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Ignoring ground truth with image id 1823103654 since it was previously added
WARNING:tensorflow:Ignoring detection with image id 1823103654 since it was previously added
creating index...
index created!
INFO:tensorflow:Loading and preparing annotation results...
INFO:tensorflow:DONE (t=27.72s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=11867.54s).
Accumulating evaluation results...
DONE (t=162.59s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.097
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.275
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.046
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.128
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.009
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.068
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.187
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.248
INFO:tensorflow:Finished evaluation at 2019-01-28-19:37:58
INFO:tensorflow:Saving dict for global step 2963: DetectionBoxes_Precision/mAP = 0.09728048, DetectionBoxes_Precision/mAP (large) = 0.12794043, DetectionBoxes_Precision/mAP (medium)
 = 0.004950495, DetectionBoxes_Precision/mAP (small) = 0.0, DetectionBoxes_Precision/mAP@.50IOU = 0.27511117, DetectionBoxes_Precision/mAP@.75IOU = 0.045991823, DetectionBoxes_Recal
l/AR@1 = 0.009229408, DetectionBoxes_Recall/AR@10 = 0.06773527, DetectionBoxes_Recall/AR@100 = 0.18692453, DetectionBoxes_Recall/AR@100 (large) = 0.24820603, DetectionBoxes_Recall/A
R@100 (medium) = 0.00021755317, DetectionBoxes_Recall/AR@100 (small) = 0.0, Loss/BoxClassifierLoss/classification_loss = 0.666948, Loss/BoxClassifierLoss/localization_loss = 2.38650
25, Loss/RPNLoss/localization_loss = 1.0936925, Loss/RPNLoss/objectness_loss = 0.31457162, Loss/total_loss = 4.4617558, global_step = 2963, learning_rate = 1e-04, loss = 4.4617558
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2963: models/model.ckpt-2963
INFO:tensorflow:Saving checkpoints for 2964 into models/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-28-19:38:17
INFO:tensorflow:Graph was finalized.
2019-01-29 03:38:18.362373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-29 03:38:18.362449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-29 03:38:18.362486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-01-29 03:38:18.362495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-01-29 03:38:18.362718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) ->
 physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:0c:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from models/model.ckpt-2964
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Ignoring ground truth with image id 1823103654 since it was previously added
WARNING:tensorflow:Ignoring detection with image id 1823103654 since it was previously added
creating index...
index created!
INFO:tensorflow:Loading and preparing annotation results...
INFO:tensorflow:DONE (t=19.12s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*

Maybe it is because evaluation step takes too long time, it evaluates at each train iteration.

And my dataset generation script:

import PIL.Image
import tensorflow as tf
import argparse
import hashlib
import io
import logging
import os
from lxml import etree
import random
from tqdm import tqdm
import contextlib2

from object_detection.utils import dataset_util
from object_detection.utils import label_map_util
from object_detection.dataset_tools import tf_record_creation_util

PBRS_ROOT = '/mnt/disk3/zzz/pbrs'
LABEL_MAP_PATH = '/mnt/disk3/zzz/pbrs/processed/2d_det/label_map.pbtxt'

parser = argparse.ArgumentParser()
parser.add_argument("-o", "--output_path", default="/mnt/disk3/zzz/pbrs/processed/2d_det", help="Path to output TFRecord")
args = parser.parse_args()

def create_tf_example(img_path, bbox_path, label_map_dict):
    with tf.gfile.GFile(img_path, 'rb') as fid:
        encoded_jpg = fid.read()
    encoded_jpg_io = io.BytesIO(encoded_jpg)
    image = PIL.Image.open(encoded_jpg_io)
    key = hashlib.sha256(encoded_jpg).hexdigest()

    width, height = 640, 480 

    xmin = []
    ymin = []
    xmax = []
    ymax = []
    classes = []
    classes_text = []

    f = open(bbox_path)
    lines = f.readlines()
    for line in lines:
        p = list(map(int, line.split()))
        xmin.append(float(p[2]) / width)
        ymin.append(float(p[1]) / height)
        xmax.append(float(p[4] + 1) / width)
        ymax.append(float(p[3] + 1) / height)
        classes_text.append('model'.encode('utf8'))
        classes.append(label_map_dict['model'])

    example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': dataset_util.int64_feature(height),
        'image/width': dataset_util.int64_feature(width),
        'image/filename': dataset_util.bytes_feature(img_path.encode('utf8')),
        'image/source_id': dataset_util.bytes_feature(img_path.encode('utf8')),
        'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
        'image/encoded': dataset_util.bytes_feature(encoded_jpg),
        'image/format': dataset_util.bytes_feature('jpeg'.encode('utf8')),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmin),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmax),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymin),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymax),
        'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
        'image/object/class/label': dataset_util.int64_list_feature(classes),
    }))
    return example

def create_tf_record(output_path, image_list, label_map_dict, num_shards):
    with contextlib2.ExitStack() as tf_record_close_stack:
        output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
            tf_record_close_stack, output_path, num_shards)
        for idx, image in enumerate(tqdm(image_list)):
            bbox_path = os.path.join(PBRS_ROOT, '2d_bbox', image[0], '{}.txt'.format(image[1]))
            img_path = os.path.join(PBRS_ROOT, 'opengl_v2', image[0], '{}_color.jpg'.format(image[1]))
            tf_example = create_tf_example(img_path, bbox_path, label_map_dict)
            output_shard_index = idx % num_shards
            output_tfrecords[output_shard_index].write(tf_example.SerializeToString())

def main():
    train_path = os.path.join(args.output_path, 'pbrs_2ddet_train.record')
    val_path = os.path.join(args.output_path, 'pbrs_2ddet_val.record')
    label_map_dict = label_map_util.get_label_map_dict(LABEL_MAP_PATH)

    house_id_list = os.listdir(os.path.join(PBRS_ROOT, 'node_v2'))
    image_list = []
    for house_id in house_id_list:
        camera_list = os.listdir(os.path.join(PBRS_ROOT, 'node_v2', house_id))
        camera_list = list(map(lambda x: x[:6], camera_list))
        for camera in camera_list:
            image_list.append((house_id, camera))
    random.seed(42)
    random.shuffle(image_list)

    num_examples = len(image_list)
    num_train = int(0.9 * num_examples)
    create_tf_record(train_path, image_list[:num_train], label_map_dict, num_shards=100)
    create_tf_record(val_path, image_list[num_train:], label_map_dict, num_shards=10)

if __name__ == '__main__':
    main()

There are about 500000 images in my train dataset, 50000 in val.

tienduchoang commented 5 years ago

Hello KuribohG, Did you fix this problem?

KuribohG commented 5 years ago

Hello KuribohG, Did you fix this problem?

No, I have migrated to maskrcnn-benchmark. I found in my first version dataset, too many bounding boxes were added into the tfrecord file wrongly. But after remove them, this problem was still unsolved.

auslaner commented 5 years ago

Maybe it is because evaluation step takes too long time, it evaluates at each train iteration.

I can at least help with this part.

You need to add the throttle_secs parameter to the EvalSpec in object_detection/model_lib.py. With a value of 18000, it will only try to evaluate every 5 hours, so you'll at least get 1 hour of training in between your evals if they take 4 hours to complete.

eval_specs.append(
        tf.estimator.EvalSpec(
            name=eval_spec_name,
            input_fn=eval_input_fn,
            steps=None,
            exporters=exporter,
            throttle_secs=18000))

But it seems like the real issue here is the length of time the evals are taking.

thusinh1969 commented 5 years ago

With large dataset, it is advisable to keep max_eval = 1 so that training going through ALL samples at least once before doing any validation. Yes you can set eval_interval_secs=3600 as well provided it takes less than 1 hour to finish 1 full epoch. After that you can Ctrl-C, change max_evals eval_interval_secs and to whatever you want to evaluate more frequently.

Steve

zishanahmed08 commented 5 years ago

any updates on this?

wojiaohumaocheng commented 5 years ago

can u solve this problem? I meet same problem

chanwkaa commented 4 years ago

i meet same problem? anyone solved this?

ssbagalkar commented 4 years ago

Same issue!

wl4135 commented 4 years ago

the same issue!!

timakov-dmitry commented 4 years ago

same issue (oid)

canhld94 commented 4 years ago

same issue here (custom dataset)

jonpsy commented 3 years ago

Same issue, keep alive.

xdhmoore commented 3 years ago

I'm still piecing together how to do this myself, but I found this in input_reader.proto:

  // Integer representing how often an example should be sampled. To feed
  // only 1/3 of your data into your model, set `sample_1_of_n_examples` to 3.
  // This is particularly useful for evaluation, where you might not prefer to
  // evaluate all of your samples
  optional uint32 sample_1_of_n_examples = 22 [default = 1];

It looks like this just uses Dataset.shard under the hood, so the result (when set to 3) will be all the data items whose index mod 3 == 0.

Haven't tried this yet, but it appears to be a possible way to limit eval size with num_eval_steps and num_examples not working.

ASMIftekhar commented 3 years ago

Same issue, keep alive.