tensorflow / models

Models and examples built with TensorFlow
Other
77.24k stars 45.75k forks source link

Export Object detection model (V2) fails on assertion "assert_existing_objects_matched" #8953

Open veonua opened 4 years ago

veonua commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection

2. Describe the bug

I'm having the OOM issue on the big models, so I tried to train a dummy model,

faster_rcnn {
    num_classes: 9
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 80
        max_dimension: 100
        pad_to_max_dimension: true
      }
    }
...
train_config: {
  batch_size: 1
  num_steps: 200
....

for a some reason resulting checkpoint files are very small ckpt-1.index = 247 bytes ckpt-1.data-00000-of-00001 = 864 bytes

Export of this dummy model fails with assertion

raise AssertionError( ("Some Python objects were not bound to checkpointed values, likely due to changes in the Python program: %s") % (list(unused_python_objects),))

3. Steps to reproduce

Train dummy faster_rcnn model without finetune checkpoint

4. Expected behavior

final checkpoint file has to be ~100 Mb like in v1. Export happens without errors

5. Additional context

6. System information

ravikyram commented 4 years ago

@veonua

Request you to share complete code snippet or steps to reproduce the issue in our environment.It helps us in localizing the issue faster.Thanks!

veonua commented 4 years ago

@ravikyram https://gist.github.com/veonua/e4186c92df80b49ad3d813f1219d0727

I'm using latest master version of object detection API

object_detection/model_main_tf2.py --model_dir=./output --pipeline_config_path=checkpoint/pipeline.config --num_train_steps=1000

object_detection/exporter_main_v2.py --input_type=image_tensor --trained_checkpoint_dir="./output" --output_directory="./model" --pipeline_config_path=checkpoint/pipeline.config

please let me know if you need any more information

aminzg commented 4 years ago

I get the same error with any model. Attached (TF2 Error.txt) is the terminal output with ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03 (also tried bunch of other models including efficientdet_d0_coco17_tpu-32) Here's what I run:

python model_main_tf2.py \
  --pipeline_config_path=training/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.config \
  --model_dir=training/ \
  --alsologtostderr

I get the AssertionError, followed by tons of warnings related to weight loading.

AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program:

The platform I use:

I've tried on a brand new machine with fresh installation as well, the issue is persistant.

LiuXiaolong19920720 commented 4 years ago

Same Error:

AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program:
PetreanuAndi commented 4 years ago

same error. Bump. It happens to me when loading CenterNet

veonua commented 4 years ago

as the temporary solution, I've removed

status.assert_existing_objects_matched()

The model seems to be working.

XiangL-Xr commented 4 years ago

same error. It happens to me when loading CenterNet_ResNet50_v1

claverru commented 4 years ago

Any updates on this? I'm getting the same error with every TF2 model I've tried.

legacyai commented 4 years ago

Likely a bug in TF 2.3.0

midhulavijayan commented 4 years ago

Change the line in pipeline.config

fine_tune_checkpoint_type: "classification" to fine_tune_checkpoint_type: "detection"

khu834 commented 3 years ago

Change the line in pipeline.config

fine_tune_checkpoint_type: "classification" to fine_tune_checkpoint_type: "detection"

For those who thumbed down this answer, can you provide some feedback as to why this is not the solution? I'm guessing if you run fine-tuning training again with this flag, then export the object detection model, it should work.

khu834 commented 3 years ago

I just tested this on the faster_rcnn_resnet50_v1_1024x1024_coco17_tpu-8.config model. First try I trained it with fine_tune_checkpoint_type: "classification", running exporter_main_v2.py resulted in the AssertionError above (even if I modify the config file to be "detection" for export)

Next, I trained the model again starting from the original pretrained model with fine_tune_checkpoint_type: "detection", running exporter_main_v2.py produced the saved_model correctly (using "detection" config)

anand08 commented 3 years ago

Hey @khu834 , i'm also getting the same error while training the model

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.97,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_keras'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.97,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.75,
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
          delta: 1.0
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint: "mobilenet_v2/mobilenet_v2.ckpt-1"
  fine_tune_checkpoint_type: "detection"
  batch_size: 96
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 7500
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .8
          total_steps: 50000
          warmup_learning_rate: 0.13333
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  label_map_path: "kaggle_dataset/annotations/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "kaggle_dataset/annotations/train.record"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}

eval_input_reader: {
  label_map_path: "kaggle_dataset/annotations/label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "kaggle_dataset/annotations/test.record"
  }
}

Above is the pipeline.config file used to train the model

Error AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program

Using, Tensorflow - 2.4.1 Detection model - ssd_mobilenet_v2_320x320_coco17_tpu-8

Can anyone help me on fixing this, Thanks in advance.

khu834 commented 3 years ago

fine_tune_checkpoint: "mobilenet_v2/mobilenet_v2.ckpt-1"

I have tested ssd_mobilenet_v2 training and export on TF 2.4.0 it has worked fine. Can you try the following?

If the fine_tuning_checkpoint and your training config are both in the 'detection' format, the export should work.

JuPasquin commented 3 years ago

Hi,

I'm also having the same issue. I'm trying to train from scratch by commenting the fine-tuning parameters, and the same error message occurs when running exporter_main_v2_py.

I'm using:

Modifications made to the original config file:

Also, if removing status.assert_existing_objects_matched(), I'm able to save the model, but a warning shows up:

WARNING:tensorflow:Skipping full serialization of Keras layer <object_detection.meta_architectures.center_net_meta_arch.CenterNetMetaArch object at 0x7f11e6625f70>, because it is not built. W0817 09:44:23.985459 139717209786176 save_impl.py:76] Skipping full serialization of Keras layer <object_detection.meta_architectures.center_net_meta_arch.CenterNetMetaArch object at 0x7f11e6625f70>, because it is not built.

If I try to reload the model, I'm not able to retrieve any information because it is set as a _UserObject. Error when using model.summary():

AttributeError: '_UserObject' object has no attribute 'summary'

Thank you in advance.