Object Detection: cannot finetune 10 classes or less by using the model ssd_resnet_50_fpn_coco #5028

Open highfly22 opened 6 years ago

highfly22 commented 6 years ago

System information

    python object_detection/model_main.py\
    --logtostderr \
    --pipeline_config_path=${DIR}/logo-detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640.config \
model {
  ssd {
    num_classes: 10
    image_resizer {
      fixed_shape_resizer {
        height: 640
        width: 640
    feature_extractor {
      type: "ssd_resnet50_v1_fpn"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 0.000399999989895
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.0299999993294
        activation: RELU_6
        batch_norm {
          decay: 0.996999979019
          scale: true
          epsilon: 0.0010000000475
      override_base_feature_extractor_hyperparams: true
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
    similarity_calculator {
      iou_similarity {
    box_predictor {
      weight_shared_convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 0.000399999989895
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.00999999977648
          activation: RELU_6
          batch_norm {
            decay: 0.996999979019
            scale: true
            epsilon: 0.0010000000475
        depth: 256
        num_layers_before_predictor: 4
        kernel_size: 3
        class_prediction_bias_init: -4.59999990463
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        scales_per_octave: 2
    post_processing {
      batch_non_max_suppression {
        score_threshold: 0.300000011921
        iou_threshold: 0.600000023842
        max_detections_per_class: 100
        max_total_detections: 100
      score_converter: SIGMOID
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 2.0
          alpha: 0.25
      classification_weight: 1.0
      localization_weight: 1.0
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
train_config {
  batch_size: 1
  data_augmentation_options {
    random_horizontal_flip {
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.0399999991059
          total_steps: 50000
          warmup_learning_rate: 0.0133330002427
          warmup_steps: 2000
      momentum_optimizer_value: 0.899999976158
    use_moving_average: false
  fine_tune_checkpoint: "/opt/ml/data/logo-detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  fine_tune_checkpoint_type:  "detection"
  # num_steps: 25000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
train_input_reader {
  label_map_path: "/opt/ml/data/logo-detection/logo-label-map.pbtxt"
  tf_record_input_reader {
    input_path: "/opt/ml/data/logo-detection/dataset-train.tfrecord"
eval_config {
  num_examples: 8000
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
eval_input_reader {
  label_map_path: "/opt/ml/data/logo-detection/logo-label-map.pbtxt"
  shuffle: false
  num_readers: 1
  tf_record_input_reader {
    input_path: "/opt/ml/data/logo-detection/dataset-val.tfrecord"

Describe the problem

With commit 02a9969e94feb51966f9bacddc1836d811f8ce69 , I try to finetune ssd_resnet_50_fpn_coco for 10 classes object detection.

Source code / logs

2018-08-08 03:26:17.852738: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at iterator_ops.cc:891 : Invalid argument: indices[2] = 2 is not in [0, 2)
     [[Node: Gather_4 = Gather[Tindices=DT_INT64, Tparams=DT_INT64, validate_indices=true](cond/Merge, Reshape_8)]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[2] = 2 is not in [0, 2)
     [[Node: Gather_4 = Gather[Tindices=DT_INT64, Tparams=DT_INT64, validate_indices=true](cond/Merge, Reshape_8)]]
     [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[1], [1,640,640,3], [1,3], [1,100], [1,100,4], [1,100,10], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
     [[Node: IteratorGetNext/_3859 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_669_IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
tensorflowbutler commented 6 years ago

What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

pkulzc commented 6 years ago

Did this work in the past? Could you please sync to HEAD and try again?

highfly22 commented 6 years ago

It did work when the number of class is 15. But it failed when the number of class is 10 or less.

I have been tested on the HEAD. The result is the same.

kareem1925 commented 6 years ago

just a simple question

does your evaluation data have 8000 samples? maybe this is what causes the error

i'm just guessing

eval_config {
num_examples: 8000
metrics_set: "coco_detection_metrics"
use_moving_averages: false
AVCarreiro commented 6 years ago

I can get it to work with a single class, but although the loss is smoothly decreasing, it outputs no boxes at all, resulting in 0% precision and recall (at least up to 5000 steps, as I'm still running). Warm-up is only 2000 steps.

MLikelihood commented 5 years ago

I encountered the same problem... One class object detection does not work using either ssd, retinanet, faster r cnn?

BalajiB3663 commented 5 years ago

I am facing same issue, from past 2 months I am working on the same issue anybody have found any fix for this, my custom dataset have 9 classes, when i train in faster rcnn resnet101 model it works but when I train with ssd resnet50 fpn model it is not detecting at all.

pkulzc commented 5 years ago

For all people who commented here, what is your situation?

  1. facing the same error (something like "indices[2] = 2 is not in [0, 2)"
  2. your transfer learning can run but doesn't perform well

These are two different issues and the first one should have been fixed be early pull requests.

If you are still facing issue 1, please file a separate issue with detailed config and error log info. If you are facing issue 2, please tune your parameters( e.g. train more steps, higher/lower learning rates)

wjsakfh commented 5 years ago

Anyone solve this problem now ? Please let me know your comments.

thusinh1969 commented 5 years ago

I face the same problem. During training NO BBOX/CLASS detected AT ALL or very very rare ! Original model from zoo used for transfered learning w/ Tensorflow Object Detection w/ custom dataset 70,000 images with 2 classes only. Same dataset training with ResNet101 give good result.


Please help. Steve

tensorflowbutler commented 4 years ago

matiasSaavedra commented 4 years ago

I still have the problem. No bbox/class, mAP ~10e-4. Loss is decreasing.