Object Detection: cannot finetune 10 classes or less by using the model ssd_resnet_50_fpn_coco

highfly22 commented 6 years ago

System information

Linux Ubuntu 16.04
TensorFlow installed from anaconda
TensorFlow 1.7
CUDA 9.0/cuDNN 7
P40
command

    python object_detection/model_main.py\
    --logtostderr \
    --pipeline_config_path=${DIR}/logo-detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640.config \
    --model_dir=${OUTPUT_DIR}

model {
  ssd {
    num_classes: 10
    image_resizer {
      fixed_shape_resizer {
        height: 640
        width: 640
      }
    }
    feature_extractor {
      type: "ssd_resnet50_v1_fpn"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 0.000399999989895
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.0299999993294
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.996999979019
          scale: true
          epsilon: 0.0010000000475
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 0.000399999989895
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.00999999977648
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.996999979019
            scale: true
            epsilon: 0.0010000000475
          }
        }
        depth: 256
        num_layers_before_predictor: 4
        kernel_size: 3
        class_prediction_bias_init: -4.59999990463
      }
    }
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        scales_per_octave: 2
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 0.300000011921
        iou_threshold: 0.600000023842
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 2.0
          alpha: 0.25
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
  }
}
train_config {
  batch_size: 1
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.0399999991059
          total_steps: 50000
          warmup_learning_rate: 0.0133330002427
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.899999976158
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "/opt/ml/data/logo-detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  fine_tune_checkpoint_type:  "detection"
  # num_steps: 25000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}
train_input_reader {
  label_map_path: "/opt/ml/data/logo-detection/logo-label-map.pbtxt"
  tf_record_input_reader {
    input_path: "/opt/ml/data/logo-detection/dataset-train.tfrecord"
  }
}
eval_config {
  num_examples: 8000
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}
eval_input_reader {
  label_map_path: "/opt/ml/data/logo-detection/logo-label-map.pbtxt"
  shuffle: false
  num_readers: 1
  tf_record_input_reader {
    input_path: "/opt/ml/data/logo-detection/dataset-val.tfrecord"
  }
}

Describe the problem

With commit 02a9969e94feb51966f9bacddc1836d811f8ce69 , I try to finetune ssd_resnet_50_fpn_coco for 10 classes object detection.

Source code / logs

2018-08-08 03:26:17.852738: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at iterator_ops.cc:891 : Invalid argument: indices[2] = 2 is not in [0, 2)
     [[Node: Gather_4 = Gather[Tindices=DT_INT64, Tparams=DT_INT64, validate_indices=true](cond/Merge, Reshape_8)]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[2] = 2 is not in [0, 2)
     [[Node: Gather_4 = Gather[Tindices=DT_INT64, Tparams=DT_INT64, validate_indices=true](cond/Merge, Reshape_8)]]
     [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[1], [1,640,640,3], [1,3], [1,100], [1,100,4], [1,100,10], [1,100], [1,100], [1,100], [1]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
     [[Node: IteratorGetNext/_3859 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_669_IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

pkulzc commented 6 years ago

Did this work in the past? Could you please sync to HEAD and try again?

highfly22 commented 6 years ago

It did work when the number of class is 15. But it failed when the number of class is 10 or less.

I have been tested on the HEAD. The result is the same.

kareem1925 commented 6 years ago

just a simple question

does your evaluation data have 8000 samples? maybe this is what causes the error

i'm just guessing


eval_config {
num_examples: 8000
metrics_set: "coco_detection_metrics"
use_moving_averages: false
}

AVCarreiro commented 6 years ago

I can get it to work with a single class, but although the loss is smoothly decreasing, it outputs no boxes at all, resulting in 0% precision and recall (at least up to 5000 steps, as I'm still running). Warm-up is only 2000 steps.

MLikelihood commented 5 years ago

I encountered the same problem... One class object detection does not work using either ssd, retinanet, faster r cnn?

BalajiB3663 commented 5 years ago

I am facing same issue, from past 2 months I am working on the same issue anybody have found any fix for this, my custom dataset have 9 classes, when i train in faster rcnn resnet101 model it works but when I train with ssd resnet50 fpn model it is not detecting at all.

pkulzc commented 5 years ago

For all people who commented here, what is your situation?

facing the same error (something like "indices[2] = 2 is not in [0, 2)"
your transfer learning can run but doesn't perform well

These are two different issues and the first one should have been fixed be early pull requests.

If you are still facing issue 1, please file a separate issue with detailed config and error log info. If you are facing issue 2, please tune your parameters( e.g. train more steps, higher/lower learning rates)

wjsakfh commented 5 years ago

Anyone solve this problem now ? Please let me know your comments.

thusinh1969 commented 5 years ago

I face the same problem. During training NO BBOX/CLASS detected AT ALL or very very rare ! Original model from zoo used for transfered learning w/ Tensorflow Object Detection w/ custom dataset 70,000 images with 2 classes only. Same dataset training with ResNet101 give good result.

ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03

Please help. Steve

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

matiasSaavedra commented 4 years ago

I still have the problem. No bbox/class, mAP ~10e-4. Loss is decreasing.

tensorflow / models