tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

Problem in retraining ssd_mobilenet on my own dataset using TensorFlow object detection API #5385

Closed prasanth-ntu closed 4 years ago

prasanth-ntu commented 6 years ago

I have followed the tutorial (https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial) to retrain 'ssd_mobilenet_v1_coco' on my own dataset. I am trying to use atleast 1 or more classes of my own dataset. I converted my dataset with annotated bounding boxes to the desired TFRecord format without errors and changed the .config file as well. I also used dataset created by other user from github (https://github.com/datitran/raccoon_dataset).


My server environment details: GPU: 4 x GeForce GTX 1080 Ti Ubuntu 16.04.4 LTS TensorFlow GPU Version: 1.9.0 Running in Virtual environment CUDA Version 9.0.176


I then executed the cmd below to retrain the 'ssd_mobilenet_v1_coco' on my own dataset.: python3 legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_pets.config


What I observe is that the the training takes few random number of steps and stops throwing me an error messsage. I have attached a portion of training info below:

INFO:tensorflow:Restoring parameters from ssd_mobilenet_v1_coco_11_06_2017/model.ckpt I0927 01:01:09.224665 139707445737216 tf_logging.py:115] Restoring parameters from ssd_mobilenet_v1_coco_11_06_2017/model.ckpt INFO:tensorflow:Running local_init_op. I0927 01:01:09.527872 139707445737216 tf_logging.py:115] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0927 01:01:09.858946 139707445737216 tf_logging.py:115] Done running local_init_op. INFO:tensorflow:Starting Session. I0927 01:01:18.386830 139707445737216 tf_logging.py:115] Starting Session. INFO:tensorflow:Saving checkpoint to path training/model.ckpt I0927 01:01:18.674590 139650089871104 tf_logging.py:115] Saving checkpoint to path training/model.ckpt INFO:tensorflow:Starting Queues. I0927 01:01:18.680459 139707445737216 tf_logging.py:115] Starting Queues. INFO:tensorflow:global_step/sec: 0 I0927 01:01:25.934769 139650014369536 tf_logging.py:159] global_step/sec: 0 INFO:tensorflow:Recording summary at step 0. I0927 01:01:35.952289 139650081478400 tf_logging.py:115] Recording summary at step 0. INFO:tensorflow:global step 1: loss = 13.3887 (17.959 sec/step) I0927 01:01:36.882163 139707445737216 tf_logging.py:115] global step 1: loss = 13.3887 (17.959 sec/step) INFO:tensorflow:global step 2: loss = 12.4938 (0.381 sec/step) I0927 01:01:37.541383 139707445737216 tf_logging.py:115] global step 2: loss = 12.4938 (0.381 sec/step) INFO:tensorflow:global step 3: loss = 11.3832 (0.425 sec/step) I0927 01:01:37.968488 139707445737216 tf_logging.py:115] global step 3: loss = 11.3832 (0.425 sec/step) INFO:tensorflow:global step 4: loss = 10.6289 (0.469 sec/step) I0927 01:01:38.439345 139707445737216 tf_logging.py:115] global step 4: loss = 10.6289 (0.469 sec/step) INFO:tensorflow:global step 5: loss = 10.0338 (0.400 sec/step) I0927 01:01:38.841859 139707445737216 tf_logging.py:115] global step 5: loss = 10.0338 [(0.400](sec/step) ... `INFO:tensorflow:global step 16: loss = 6.6211 (0.420 sec/step) I0927 01:22:40.314219 140712953894656 tf_logging.py:115] global step 16: loss = 6.6211 (0.420 sec/step)

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Incompatible shapes: [2,1917] vs. [3,1] [[Node: Loss/Match_22/cond/mul_4 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match_22/cond/one_hot, Loss/Match_22/cond/Cast_2)]] [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_5_depthwise/BatchNorm/AssignMovingAvg_1/mul/_3803 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge6517...gAvg_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]`

Incompatible shapes: [2,1917] vs. [3,1] changes for different dataset and keeps changing to different number now and then. Also, the program randomly stops and different number of steps.

However, when I use the 'macncheese' dataset given in the tutorial website, I am able to run beyond 10000 steps without any issues. I have checked many other posts and none of them are helping me to resolve the issue.


I have visited the following links, but still couldn't resolve the issue: [1] https://github.com/dennybritz/chatbot-retrieval/issues/15 [2] https://github.com/tensorflow/models/issues/1760 [3] https://github.com/balancap/SSD-Tensorflow/issues/88


This is my config: ` model { ssd { num_classes: 1 box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true } } similarity_calculator { iou_similarity { } } anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 } } image_resizer { fixed_shape_resizer { height: 300 width: 300 } } box_predictor { convolutional_box_predictor { min_depth: 0 max_depth: 0 num_layers_before_predictor: 0 use_dropout: false dropout_keep_probability: 0.8 kernel_size: 1 box_code_size: 4 apply_sigmoid_to_scores: false conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } } feature_extractor { type: 'ssd_mobilenet_v1' min_depth: 16 depth_multiplier: 1.0 conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } loss { classification_loss { weighted_sigmoid { anchorwise_output: true } } localization_loss { weighted_smooth_l1 { anchorwise_output: true } } hard_example_miner { num_hard_examples: 3000 iou_threshold: 0.99 loss_type: CLASSIFICATION max_negatives_per_positive: 3 min_negatives_per_image: 0 } classification_weight: 1.0 localization_weight: 1.0 } normalize_loss_by_num_matches: true post_processing { batch_non_max_suppression { score_threshold: 1e-8 iou_threshold: 0.6 max_detections_per_class: 100 max_total_detections: 100 } score_converter: SIGMOID } } }

train_config: { batch_size: 24 optimizer { rms_prop_optimizer: { learning_rate: { exponential_decay_learning_rate { initial_learning_rate: 0.004 decay_steps: 800720 decay_factor: 0.95 } } momentum_optimizer_value: 0.9 decay: 0.9 epsilon: 1.0 } } fine_tune_checkpoint: "ssd_mobilenet_v1_coco_11_06_2017/model.ckpt" from_detection_checkpoint: true data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { ssd_random_crop { } } }

train_input_reader: { tf_record_input_reader { input_path: "data/train.record" } label_map_path: "data/object-detection.pbtxt" }

eval_config: { num_examples: 40 }

eval_input_reader: { tf_record_input_reader { input_path: "data/test.record" } label_map_path: "training/object-detection.pbtxt" shuffle: false num_readers: 1 } `

This is my object-detection.pbtxt item { id: 1 name: 'raccoon' }

tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

ymchen7 commented 6 years ago

Same error here just after merge recent object detection api. It seems that 'legacy/train.py' is not compatible to recent changes.

prasanth-ntu commented 6 years ago

Same error here just after merge recent object detection api. It seems that 'legacy/train.py' is not compatible to recent changes.

So, does that mean if I use older version of object detection api, I can train the model without any issues?

t27 commented 6 years ago

Yes @prasanth-ntu, the tutorial you have linked is quite old(August 2017). If you're following that tutorial, you can switch the current repository to a commit which was around that date. Also ensure to change your version of tensorflow which is compatible to that version of the Object Detection repository. Then, yes you can train your model according to the tutorial. The tensorflow library and the object detection api have undergone quite a few changes in the past year, so using the latest version of this repository for the steps in that tutorial might not work exactly.

omrylcn commented 6 years ago

Hi, I got same error with ssdlite_mobilenetv2

leeseng0629 commented 6 years ago

Hi, I got same error while using ssd_inception_v2_coco

pomonam commented 6 years ago

I analyzed the code, the new update does not take into consideration fine_tune_checkpoint. They need to update tutorials reflecting the recent changes.

jxnkjx commented 6 years ago

Hi, I got same error while using ssd_resnet_v1_fpn, anyone may help me ?

pomonam commented 6 years ago

@jxnkjx Try using older revision.

Viile1 commented 6 years ago

for example which version? and tensorflow develeoper will regulate this problem?

naisy commented 6 years ago

I succeeded in learning using model_main.py with tf 1.10.1. (maybe need tf 1.9 or later) old: python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config new: python model_main.py --alsologtostderr --model_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config

Reference: Running Locally https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md

albarema commented 5 years ago

Why would we use mode_main.py instead of train.py to retrain the model with our own dataset?

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.