tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

Training SSD-MobilenetV2 fails with Message type "object_detection.protos.TrainConfig" has no field named "fine_tune_checkpoint_version" #9297

Closed acidassassin closed 4 years ago

acidassassin commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/...

2. Describe the bug

So i am trying to train a new model based on the SSD MobileNet V2 FPNLite 320x320 checkpoints with the help of the GCP AI Platform. I am using TPUs for it as mentioned under this page: link

I am getting the following error: The replica master 0 exited with a non-zero status of 1. Traceback (most recent call last): [...] File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/root/.local/lib/python3.7/site-packages/object_detection/model_main_tf2.py", line 110, in main record_summaries=FLAGS.record_summaries) File "/root/.local/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 470, in train_loop pipeline_config_path, config_override=config_override) File "/root/.local/lib/python3.7/site-packages/object_detection/utils/config_util.py", line 139, in get_configs_from_pipeline_file text_format.Merge(proto_str, pipeline_config) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 734, in Merge allow_unknown_field=allow_unknown_field) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 802, in MergeLines return parser.MergeLines(lines, message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 827, in MergeLines self._ParseOrMerge(lines, message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 849, in _ParseOrMerge self._MergeField(tokenizer, message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 974, in _MergeField merger(tokenizer, message, field) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 1048, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 941, in _MergeField (message_descriptor.full_name, name)) google.protobuf.text_format.ParseError: 172:3 : Message type "object_detection.protos.TrainConfig" has no field named "fine_tune_checkpoint_version".

3. Steps to reproduce

Follow the mentioned link and try it with the SSD MobileNet V2 FPNLite 320x320 checkpoints. My pipeline-config looks like this: model { ssd { num_classes: 4 image_resizer { fixed_shape_resizer { height: 320 width: 320 } } feature_extractor { type: "ssd_mobilenet_v2_fpn_keras" depth_multiplier: 1.0 min_depth: 16 conv_hyperparams { regularizer { l2_regularizer { weight: 3.9999998989515007e-05 } } initializer { random_normal_initializer { mean: 0.0 stddev: 0.009999999776482582 } } activation: RELU_6 batch_norm { decay: 0.996999979019165 scale: true epsilon: 0.0010000000474974513 } } use_depthwise: true override_base_feature_extractor_hyperparams: true fpn { min_level: 3 max_level: 7 additional_layer_depth: 128 } } box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true use_matmul_gather: true } } similarity_calculator { iou_similarity { } } box_predictor { weight_shared_convolutional_box_predictor { conv_hyperparams { regularizer { l2_regularizer { weight: 3.9999998989515007e-05 } } initializer { random_normal_initializer { mean: 0.0 stddev: 0.009999999776482582 } } activation: RELU_6 batch_norm { decay: 0.996999979019165 scale: true epsilon: 0.0010000000474974513 } } depth: 128 num_layers_before_predictor: 4 kernel_size: 3 class_prediction_bias_init: -4.599999904632568 share_prediction_tower: true use_depthwise: true } } anchor_generator { multiscale_anchor_generator { min_level: 3 max_level: 7 anchor_scale: 4.0 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 scales_per_octave: 2 } } post_processing { batch_non_max_suppression { score_threshold: 9.99999993922529e-09 iou_threshold: 0.6000000238418579 max_detections_per_class: 100 max_total_detections: 100 use_static_shapes: false } score_converter: SIGMOID } normalize_loss_by_num_matches: true loss { localization_loss { weighted_smooth_l1 { } } classification_loss { weighted_sigmoid_focal { gamma: 2.0 alpha: 0.25 } } classification_weight: 1.0 localization_weight: 1.0 } encode_background_as_zeros: true normalize_loc_loss_by_codesize: true inplace_batchnorm_update: true freeze_batchnorm: false } } train_config { batch_size: 128 data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { random_crop_image { min_object_covered: 0.0 min_aspect_ratio: 0.75 max_aspect_ratio: 3.0 min_area: 0.75 max_area: 1.0 overlap_thresh: 0.0 } } sync_replicas: true optimizer { momentum_optimizer { learning_rate { cosine_decay_learning_rate { learning_rate_base: 0.07999999821186066 total_steps: 50000 warmup_learning_rate: 0.026666000485420227 warmup_steps: 1000 } } momentum_optimizer_value: 0.8999999761581421 } use_moving_average: false } fine_tune_checkpoint: "gs://tom-master-od-bucket/models/cocossdoid_output/checkpoint/ckpt-0" num_steps: 50000 startup_delay_steps: 0.0 replicas_to_aggregate: 8 max_number_of_boxes: 100 unpad_groundtruth_tensors: false fine_tune_checkpoint_type: "classification" fine_tune_checkpoint_version: V2 } train_input_reader { label_map_path: "gs://tom-master-od-bucket/data/label_bbox.pbtxt" tf_record_input_reader { input_path: "gs://tom-master-od-bucket/data/train.tfrecord" } } eval_config { metrics_set: "coco_detection_metrics" use_moving_averages: false } eval_input_reader { label_map_path: "gs://tom-master-od-bucket/data/label_bbox.pbtxt" shuffle: false num_epochs: 1 tf_record_input_reader { input_path: "gs://tom-master-od-bucket/data/validation.tfrecord" } } I call the gcloud command like this: gcloud ai-platform jobs submit trainingwhoami_object_detection_date +%m%d%Y%H%M_%S\ --job-dir=gs://${MODEL_DIR} \ --package-path=./object_detection \ --module-name=object_detection.model_main_tf2 \ --runtime-version=2.2 \ --python-version=3.7 \ --scale-tier=BASIC_TPU \ --region=us-central1 \ -- \ --distribution_strategy=tpu \ --model_dir=gs://${MODEL_DIR} \ --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

4. Expected behavior

The training starts.

Does anyone know what the problem is?

Thanks and best regards, Tom

google-ml-butler[bot] commented 4 years ago

Are you satisfied with the resolution of your issue? Yes No

haltersweb commented 3 years ago

@acidassassin, I see you closed this ticket on the same day you opened it. I am having the same problem. Did you solve the problem? What was your solution?

RafaelSCoelho commented 3 years ago

@haltersweb, Try deleting the "fine-tuning checkpoint version" line from the configuration file. For me it worked fine.

Mark8282 commented 3 years ago

@haltersweb same problem. Did you solve it?

priiyaanjaalii0611 commented 2 years ago

@haltersweb same problem. Did you solve it?

Yes just delete the fine-tuning checkpoint version line172 from your pipeline config file

DmblnNicole commented 2 years ago

When I delete "fine-tuning checkpoint version: V2" in line 172 I get a ValueError when training:

Value Error: Checkpoint version should be V2

So deleting doesn't work for me. Did someone solve this?

Aman7Rathore commented 2 years ago

When I delete "fine-tuning checkpoint version: V2" in line 172 I get a ValueError when training:

Value Error: Checkpoint version should be V2

So deleting doesn't work for me. Did someone solve this?

Aman7Rathore commented 2 years ago

I am facing the same issue can someone help quickly

maltelandgren commented 2 years ago

Also facing the same issue

DoppiaEffe94 commented 2 years ago

Deleting the line fine_tune_checkpoint_version: V2 doesn't work when training, in fact.

What was working for me, instead, was:

Substituting, inside the object_detection/utils/config_util.py (or go inside the get_configs_from_pipeline_file(pipeline_config_path, config_override=None) function), the line 137:

with tf.gfile.GFile(pipeline_config_path, "r") as f:

with:

with tf.io.gfile.GFile(pipeline_config_path, "r") as f:

ChamithDilshan commented 1 year ago

Commenting out the fine-tuning checkpoint version line 172 from the pipeline config file worked for me.

ankushkumarpatiyal commented 1 year ago

When I delete "fine-tuning checkpoint version: V2" in line 172 I get a ValueError when training:

Value Error: Checkpoint version should be V2

So deleting doesn't work for me. Did someone solve this? yeah i get this error too i am using python 3.8 with tensorflow 2.13 and object detection api version 0.1.1 anyone got a solution ?

salsafir commented 5 months ago

Commenting out the fine-tuning checkpoint version line 172 from the pipeline config file worked for me.

how you commenting