Object detection API: Training stuck at step=0 for ssd mobilenetv2

jackyvr commented 4 years ago

System information

Didn't change the code but used my own data:
Windows 10 + conda
TensorFlow installed from binary
TensorFlow version: v1.15.0-rc3-22-g590d6eef7e 1.15.0
Python version: 3.7.6
CUDA/cuDNN version: 10.0
GPU model and memory: GeForce GTX 1080 Ti

Please provide the entire URL of the model you are using? http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_oid_v4_2018_12_12.tar.gz

Describe the current behavior The training stops at step = 0.

I wanted to do transfer learning using a ssd + mobilenetv2 model with my own images. I have only one class. The images were downloaded from OpenImageDataSet. I verified that the TFRecord was correctly created as I can use the same data to train faster_rcnn with object detetion APIs. I created my own config file using the one in the repos: ssd_mobilenet_v2_oid_v4.config.

I also tried to start with ssd_mobilenet_v2_coco_2018_03_29.tar.gz using corresponding config file. The behavior is the same -- it also stuck at the same place.

Describe the expected behavior I would expect I can train the ssd + mobilenetv2 with the my data as what I did for faster_rcnn.

Code to reproduce the issue Images of one class and train with a config file like below. Thank you!

Other info / logs

#################### ssd_mobilenet_v2_oid_v4.config.

model { ssd { num_classes: 1 box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true } } similarity_calculator { iou_similarity { } } anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 } } image_resizer { fixed_shape_resizer { height: 300 width: 300 } } box_predictor { convolutional_box_predictor { min_depth: 0 max_depth: 0 num_layers_before_predictor: 0 use_dropout: false dropout_keep_probability: 0.8 kernel_size: 1 box_code_size: 4 apply_sigmoid_to_scores: false conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } } feature_extractor { type: 'ssd_mobilenet_v2' min_depth: 16 depth_multiplier: 1.0 conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } loss { classification_loss { weighted_sigmoid { } } localization_loss { weighted_smooth_l1 { } } hard_example_miner { num_hard_examples: 3000 iou_threshold: 0.99 loss_type: CLASSIFICATION max_negatives_per_positive: 3 min_negatives_per_image: 3 } classification_weight: 1.0 localization_weight: 1.0 } normalize_loss_by_num_matches: true post_processing { batch_non_max_suppression { score_threshold: 1e-8 iou_threshold: 0.6 max_detections_per_class: 100 max_total_detections: 100 } score_converter: SIGMOID } } }

train_config: { batch_size: 24 optimizer { rms_prop_optimizer: { learning_rate: { exponential_decay_learning_rate { initial_learning_rate: 0.0008 decay_steps: 800720 decay_factor: 0.95 } } momentum_optimizer_value: 0.9 decay: 0.9 epsilon: 1.0 } }

gradient_clipping_by_norm: 10.0 keep_checkpoint_every_n_hours: 24 fine_tune_checkpoint: "D:/work/cv/others/my-tf2-od-transfer-ssd-mobilenet-v2/ssd_mobilenet_v2_oid_v4_2018_12_12/model.ckpt"

num_steps: 100 data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { ssd_random_crop { } } }

train_input_reader: { tf_record_input_reader {

input_path: "D:/work/cv/others/my-tf2-od-transfer-ssd-mobilenet-v2/data/oid_glasses_train.tfrecord"

}

label_map_path: "D:/work/cv/others/my-tf2-od-transfer-ssd-mobilenet-v2/labelmap.pbtxt" }

eval_config: {

metrics_set: "open_images_V2_detection_metrics" }

eval_input_reader: { sample_1_of_n_examples: 10 tf_record_input_reader {

input_path: "D:/work/cv/others/my-tf2-od-transfer-ssd-mobilenet-v2/data/oid_glasses_validation.tfrecord"

}

label_map_path: "D:/work/cv/others/my-tf2-od-transfer-ssd-mobilenet-v2/labelmap.pbtxt" shuffle: false num_readers: 1 }

#################### CONSOLE LOG: Instructions for updating: Use standard file utilities to get mtimes. INFO:tensorflow:Running local_init_op. I0416 16:30:39.198738 19792 session_manager.py:500] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0416 16:30:39.632495 19792 session_manager.py:502] Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt. I0416 16:30:48.724722 19792 basic_session_run_hooks.py:606] Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt. 2020-04-16 16:30:59.919297: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2020-04-16 16:31:00.964680: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows Relying on driver to perform ptx compilation. This message will be only logged once. 2020-04-16 16:31:00.986098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll INFO:tensorflow:loss = 12.512502, step = 0 I0416 16:31:02.740392 19792 basic_session_run_hooks.py:262] loss = 12.512502, step = 0 [STUCK HERE]

proxip commented 4 years ago

I got same experience, I just reinstall my platform, like reinstall TensorFlow then its working

jackyvr commented 4 years ago

@proxip , Thanks for your reply. Are you also using conda in Windows? I will try.

jackyvr commented 4 years ago

@proxip , after restalling my platform I still got the same problem. Could you please look at my config file above and see if there is any problem you could spot? Thanks!

I verified that the time stamp of the event file events.out.tfevents.1587425406.MY-COMPUTER does not update.

jackyvr commented 4 years ago

I found out that the combination of TF 1.15 GPU version + my setup causes the problem: "Invoking ptxas not supported on Windows". Downgrading it to TF 1.14 GPU or using TF 1.15 CPU solves the issue. It is a common and open issue on Tensorflow: HERE

tensorflow / models

Object detection API: Training stuck at step=0 for ssd mobilenetv2 #8404