tensorflow / models

Models and examples built with TensorFlow
Other
77.23k stars 45.75k forks source link

Issue while retraining ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03 #10107

Open floflif opened 3 years ago

floflif commented 3 years ago

Hello, sorry for the inconvenience but I have currently the same issue. I'm using Tensorflow 2.5.0 with the right CUDA version, my model is ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03 and so I modified the associated config file : ssd_mobilenet_v2_quantized_300x300_coco.config (from https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config)

I did put the right path for me that is : fine_tune_checkpoint: "C:/tensorflow1/models/research/object_detection/ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03/model.ckpt" I also have my own labelmap.pbtxt, my train.record and test.record In the model folder I have the following files : model.ckpt.data-00000-of-00001 model.ckpt.index model.ckpt.meta pipeline.config tflite_graph.pb tflite_graph.pbtxt

So i also modified the required path in the "pipeline.config" file. I'm investigating since yesterday, so of course I googled it, but I did not find anything useful online to solve my error unfortunately.

And I also changed the line 83 : type: 'ssd_mobilenet_v2' to type: 'ssd_mobilenet_v2_keras' because I also got an error from this on the default config file.

When I launch the following command :

python model_main_tf2.py --pipeline_config_path=training/ssd_mobilenet_v2_quantized_300x300_coco.config --model_dir=training --alsologtostderr

But indeed, this is telling me the same error as the top of this topic :

Traceback (most recent call last):
    File "model_main_tf2.py", line 115, in <module>
      tf.compat.v1.app.run()
    File "C:\Users\Flo\mob1\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
      _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
    File "C:\Users\Flo\mob1\lib\site-packages\absl\app.py", line 312, in run
      _run_main(main, args)
    File "C:\Users\Flo\mob1\lib\site-packages\absl\app.py", line 258, in _run_main
      sys.exit(main(argv))
    File "model_main_tf2.py", line 112, in main
      record_summaries=FLAGS.record_summaries)
    File "C:\tensorflow1\models\research\object_detection\model_lib_v2.py", line 603, in train_loop
      train_input, unpad_groundtruth_tensors)
    File "C:\tensorflow1\models\research\object_detection\model_lib_v2.py", line 389, in load_fine_tune_checkpoint
      raise IOError('Checkpoint is expected to be an object-based checkpoint.')
  OSError: Checkpoint is expected to be an object-based checkpoint.

My entire config file here :

# Quantized trained SSD with Mobilenet v2 on MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes: 15
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_keras'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 24
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "C:/tensorflow1/models/research/object_detection/ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03/model.ckpt"
  fine_tune_checkpoint_type:  "detection"
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "C:/tensorflow1/models/research/object_detection/train.record"
  }
  label_map_path: "C:/tensorflow1/models/research/object_detection/training/labelmap.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "C:/tensorflow1/models/research/object_detection/test.record"
  }
  label_map_path: "C:/tensorflow1/models/research/object_detection/training/labelmap.pbtxt"
  shuffle: false
  num_readers: 1
}

graph_rewriter {
  quantization {
    delay: 48000
    weight_bits: 8
    activation_bits: 8
  }
}

Originally posted by @Drisnor in https://github.com/tensorflow/models/issues/9278#issuecomment-871459469

Johannes-vaki commented 3 years ago

I'm not sure but you can try changing to fine_tune_checkpoint_version: V2 (the default is V1) in the train_config since you are using v2_keras. I recommend looking into the protos when you are having trouble with the pipeline.config file.

floflif commented 3 years ago

Hello, thanks for your answer but I checked and there is no attribute fine_tune_checkpoint_version anywhere, nor in the pipeline.config, nor in the ssd_mobilenet_v2_quantized_300x300_coco.config

Johannes-vaki commented 3 years ago

You have to add it. It is part of the proto's definition. https://github.com/tensorflow/models/blob/master/research/object_detection/protos/train.proto#L68

floflif commented 3 years ago

Hello so I have the following part in the config file :

fine_tune_checkpoint: "C:/tensorflow1/models/research/object_detection/ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03/model.ckpt" fine_tune_checkpoint_type: "detection" fine_tune_checkpoint_version: V2

Where I added the part that you mentionned and this seems not to train again with the same error :

Traceback (most recent call last):
  File "model_main_tf2.py", line 115, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\Flo\mob1\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\Flo\mob1\lib\site-packages\absl\app.py", line 312, in run
    _run_main(main, args)
  File "C:\Users\Flo\mob1\lib\site-packages\absl\app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 112, in main
    record_summaries=FLAGS.record_summaries)
  File "C:\tensorflow1\models\research\object_detection\model_lib_v2.py", line 603, in train_loop
    train_input, unpad_groundtruth_tensors)
  File "C:\tensorflow1\models\research\object_detection\model_lib_v2.py", line 389, in load_fine_tune_checkpoint
    raise IOError('Checkpoint is expected to be an object-based checkpoint.')
OSError: Checkpoint is expected to be an object-based checkpoint.
satojkovic commented 3 years ago

Hi @Drisnor If you use a model that is compatible with TF v2, the error will be resolved. In the load_fine_tune_checkpoint, is_object_based_checkpoint is called to check the contents of checkpoint, but it becomes false in the v1 model.

In [1]: import tensorflow.compat.v1 as tf
In [2]: var_names = [var[0] for var in tf.train.list_variables('ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03/model.ckpt')]
In [3]: '_CHECKPOINTABLE_OBJECT_GRAPH' in var_names
Out[3]: False

in the case of v2 model

In [4]: var_names = [var[0] for var in tf.train.list_variables('ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0')]
In [5]: '_CHECKPOINTABLE_OBJECT_GRAPH' in var_names
Out[5]: True

I hope this is helpful.

deepali0162 commented 2 years ago

@satojkovic I tried your recommendations but still facing the same error as reported by @Drisnor . I am struggling on this code since 2 days, could someone please help with below error.

Traceback (most recent call last):
  File "model_main_tf2.py", line 115, in <module>
    tf.compat.v1.app.run()
  File "/Users/deepali/Documents/CV_Projects/Decarb_ObjectDetection/models/venv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/deepali/Documents/CV_Projects/Decarb_ObjectDetection/models/venv/lib/python3.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/Users/deepali/Documents/CV_Projects/Decarb_ObjectDetection/models/venv/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 112, in main
    record_summaries=FLAGS.record_summaries)
  File "/Users/deepali/Documents/CV_Projects/Decarb_ObjectDetection/models/venv/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 685, in train_loop
    losses_dict = _dist_train_step(train_input_iter)
  File "/Users/deepali/Documents/CV_Projects/Decarb_ObjectDetection/models/venv/lib/python3.7/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/deepali/Documents/CV_Projects/Decarb_ObjectDetection/models/venv/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Unknown image file format. One of JPEG, PNG, GIF, BMP required.
         [[{{node case/cond/cond_jpeg/decode_image/DecodeImage}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[while/body/_1/IteratorGetNext]] [Op:__inference__dist_train_step_89441]

Function call stack:
_dist_train_step -> _dist_train_step
satojkovic commented 2 years ago

@deepali0162 How did you create the tfrecord file? Looking at the traceback log, it looks like the image format of tfrecord is not correct. I'd recommend checking the sanity of your data.

deepali0162 commented 2 years ago

@satojkovic just figured out one of the image was causing the issue, thank you so much for your reply.