Closed acidassassin closed 4 years ago
@acidassassin, I see you closed this ticket on the same day you opened it. I am having the same problem. Did you solve the problem? What was your solution?
@haltersweb, Try deleting the "fine-tuning checkpoint version" line from the configuration file. For me it worked fine.
@haltersweb same problem. Did you solve it?
@haltersweb same problem. Did you solve it?
Yes just delete the fine-tuning checkpoint version line172 from your pipeline config file
When I delete "fine-tuning checkpoint version: V2" in line 172 I get a ValueError when training:
Value Error: Checkpoint version should be V2
So deleting doesn't work for me. Did someone solve this?
When I delete "fine-tuning checkpoint version: V2" in line 172 I get a ValueError when training:
Value Error: Checkpoint version should be V2
So deleting doesn't work for me. Did someone solve this?
I am facing the same issue can someone help quickly
Also facing the same issue
Deleting the line fine_tune_checkpoint_version: V2
doesn't work when training, in fact.
What was working for me, instead, was:
Substituting, inside the object_detection/utils/config_util.py
(or go inside the get_configs_from_pipeline_file(pipeline_config_path, config_override=None)
function), the line 137:
with tf.gfile.GFile(pipeline_config_path, "r") as f:
with:
with tf.io.gfile.GFile(pipeline_config_path, "r") as f:
Commenting out the fine-tuning checkpoint version line 172 from the pipeline config file worked for me.
When I delete "fine-tuning checkpoint version: V2" in line 172 I get a ValueError when training:
Value Error: Checkpoint version should be V2
So deleting doesn't work for me. Did someone solve this? yeah i get this error too i am using python 3.8 with tensorflow 2.13 and object detection api version 0.1.1 anyone got a solution ?
Commenting out the fine-tuning checkpoint version line 172 from the pipeline config file worked for me.
how you commenting
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/research/...
2. Describe the bug
So i am trying to train a new model based on the SSD MobileNet V2 FPNLite 320x320 checkpoints with the help of the GCP AI Platform. I am using TPUs for it as mentioned under this page: link
I am getting the following error:
The replica master 0 exited with a non-zero status of 1. Traceback (most recent call last): [...] File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/root/.local/lib/python3.7/site-packages/object_detection/model_main_tf2.py", line 110, in main record_summaries=FLAGS.record_summaries) File "/root/.local/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 470, in train_loop pipeline_config_path, config_override=config_override) File "/root/.local/lib/python3.7/site-packages/object_detection/utils/config_util.py", line 139, in get_configs_from_pipeline_file text_format.Merge(proto_str, pipeline_config) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 734, in Merge allow_unknown_field=allow_unknown_field) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 802, in MergeLines return parser.MergeLines(lines, message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 827, in MergeLines self._ParseOrMerge(lines, message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 849, in _ParseOrMerge self._MergeField(tokenizer, message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 974, in _MergeField merger(tokenizer, message, field) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 1048, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/usr/local/lib/python3.7/dist-packages/google/protobuf/text_format.py", line 941, in _MergeField (message_descriptor.full_name, name)) google.protobuf.text_format.ParseError: 172:3 : Message type "object_detection.protos.TrainConfig" has no field named "fine_tune_checkpoint_version".
3. Steps to reproduce
Follow the mentioned link and try it with the SSD MobileNet V2 FPNLite 320x320 checkpoints. My pipeline-config looks like this:
model { ssd { num_classes: 4 image_resizer { fixed_shape_resizer { height: 320 width: 320 } } feature_extractor { type: "ssd_mobilenet_v2_fpn_keras" depth_multiplier: 1.0 min_depth: 16 conv_hyperparams { regularizer { l2_regularizer { weight: 3.9999998989515007e-05 } } initializer { random_normal_initializer { mean: 0.0 stddev: 0.009999999776482582 } } activation: RELU_6 batch_norm { decay: 0.996999979019165 scale: true epsilon: 0.0010000000474974513 } } use_depthwise: true override_base_feature_extractor_hyperparams: true fpn { min_level: 3 max_level: 7 additional_layer_depth: 128 } } box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true use_matmul_gather: true } } similarity_calculator { iou_similarity { } } box_predictor { weight_shared_convolutional_box_predictor { conv_hyperparams { regularizer { l2_regularizer { weight: 3.9999998989515007e-05 } } initializer { random_normal_initializer { mean: 0.0 stddev: 0.009999999776482582 } } activation: RELU_6 batch_norm { decay: 0.996999979019165 scale: true epsilon: 0.0010000000474974513 } } depth: 128 num_layers_before_predictor: 4 kernel_size: 3 class_prediction_bias_init: -4.599999904632568 share_prediction_tower: true use_depthwise: true } } anchor_generator { multiscale_anchor_generator { min_level: 3 max_level: 7 anchor_scale: 4.0 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 scales_per_octave: 2 } } post_processing { batch_non_max_suppression { score_threshold: 9.99999993922529e-09 iou_threshold: 0.6000000238418579 max_detections_per_class: 100 max_total_detections: 100 use_static_shapes: false } score_converter: SIGMOID } normalize_loss_by_num_matches: true loss { localization_loss { weighted_smooth_l1 { } } classification_loss { weighted_sigmoid_focal { gamma: 2.0 alpha: 0.25 } } classification_weight: 1.0 localization_weight: 1.0 } encode_background_as_zeros: true normalize_loc_loss_by_codesize: true inplace_batchnorm_update: true freeze_batchnorm: false } } train_config { batch_size: 128 data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { random_crop_image { min_object_covered: 0.0 min_aspect_ratio: 0.75 max_aspect_ratio: 3.0 min_area: 0.75 max_area: 1.0 overlap_thresh: 0.0 } } sync_replicas: true optimizer { momentum_optimizer { learning_rate { cosine_decay_learning_rate { learning_rate_base: 0.07999999821186066 total_steps: 50000 warmup_learning_rate: 0.026666000485420227 warmup_steps: 1000 } } momentum_optimizer_value: 0.8999999761581421 } use_moving_average: false } fine_tune_checkpoint: "gs://tom-master-od-bucket/models/cocossdoid_output/checkpoint/ckpt-0" num_steps: 50000 startup_delay_steps: 0.0 replicas_to_aggregate: 8 max_number_of_boxes: 100 unpad_groundtruth_tensors: false fine_tune_checkpoint_type: "classification" fine_tune_checkpoint_version: V2 } train_input_reader { label_map_path: "gs://tom-master-od-bucket/data/label_bbox.pbtxt" tf_record_input_reader { input_path: "gs://tom-master-od-bucket/data/train.tfrecord" } } eval_config { metrics_set: "coco_detection_metrics" use_moving_averages: false } eval_input_reader { label_map_path: "gs://tom-master-od-bucket/data/label_bbox.pbtxt" shuffle: false num_epochs: 1 tf_record_input_reader { input_path: "gs://tom-master-od-bucket/data/validation.tfrecord" } }
I call the gcloud command like this:gcloud ai-platform jobs submit training
whoami_object_detection_
date +%m%d%Y%H%M_%S\ --job-dir=gs://${MODEL_DIR} \ --package-path=./object_detection \ --module-name=object_detection.model_main_tf2 \ --runtime-version=2.2 \ --python-version=3.7 \ --scale-tier=BASIC_TPU \ --region=us-central1 \ -- \ --distribution_strategy=tpu \ --model_dir=gs://${MODEL_DIR} \ --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
4. Expected behavior
The training starts.
Does anyone know what the problem is?
Thanks and best regards, Tom