tensorflow / models

Models and examples built with TensorFlow
Other
76.98k stars 45.79k forks source link

Fails to start CenterNet HourGlass104 1024x1024 training process #9729

Open ZhongHouyu opened 3 years ago

ZhongHouyu commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

http://download.tensorflow.org/models/object_detection/tf2/20200711/centernet_hg104_1024x1024_kpts_coco17_tpu-32.tar.gz

2. Describe the bug

while I was training the CenterNet HourGlass104 1024x1024 following the [tutorial][https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/] on my own dataset, it stopped throwing many warnings.

3. Steps to reproduce

I started with the folders with the structure accommodated in TensorFlow 2 Object Detection API tutorial:

training_demo/

├─ annotations/

├─ exported-models/

├─ images/

│ ├─ test/

│ └─ train/

├─ models/

├─ pre-trained-models/

└─ README.md

downloaded and unzipped centernet into pre-trained-models,got everything ready, then started training.

I created a new folder my_centernnetin models, and set the pipleline.config as follows:

model {
  center_net {
    num_classes: 2
    feature_extractor {
      type: "hourglass_104"
      channel_means: 104.01361846923828
      channel_means: 114.03422546386719
      channel_means: 119.91659545898438
      channel_stds: 73.60276794433594
      channel_stds: 69.89082336425781
      channel_stds: 70.91507720947266
      bgr_ordering: true
    }
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 1024
        max_dimension: 1024
        pad_to_max_dimension: true
      }
    }
    object_detection_task {
      task_loss_weight: 1.0
      offset_loss_weight: 1.0
      scale_loss_weight: 0.10000000149011612
      localization_loss {
        l1_localization_loss {
        }
      }
    }
    object_center_params {
      object_center_loss_weight: 1.0
      classification_loss {
        penalty_reduced_logistic_focal_loss {
          alpha: 2.0
          beta: 4.0
        }
      }
      min_box_overlap_iou: 0.699999988079071
      max_box_predictions: 100
    }
  }
}
train_config {
  batch_size: 8
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_adjust_hue {
    }
  }
  data_augmentation_options {
    random_adjust_contrast {
    }
  }
  data_augmentation_options {
    random_adjust_saturation {
    }
  }
  data_augmentation_options {
    random_adjust_brightness {
    }
  }
  data_augmentation_options {
    random_square_crop_by_scale {
      scale_min: 0.6000000238418579
      scale_max: 1.2999999523162842
    }
  }
  optimizer {
    adam_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.0010000000474974513
          total_steps: 50000
          warmup_learning_rate: 0.0002500000118743628
          warmup_steps: 5000
        }
      }
      epsilon: 1.0000000116860974e-07
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "pre-trained-models/centernet_hg104_1024x1024_coco17_tpu-32/checkpoint/ckpt-0"
  num_steps: 50000
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  fine_tune_checkpoint_version: V2
}
train_input_reader {
  label_map_path: "annotations/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "annotations/train.record"
  }
}
eval_config {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1
}
eval_input_reader {
  label_map_path: "annotations/label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "annotations/test.record"
  }
}

I used the command as follows:

python model_main_tf2.py --model_dir=models/my_centernet --pipeline_config_path=models/my_centernet/pipeline.config

4. Expected behavior

It was expected to finish the training on my own dataset, and saved in models

and the console log should have looked like this:

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
W0716 05:24:19.105542  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
W0716 05:24:19.106541  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
W0716 05:24:19.107540  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
W0716 05:24:19.108539  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0716 05:24:19.108539  1364 util.py:151] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
INFO:tensorflow:Step 100 per-step time 1.153s loss=0.761
I0716 05:26:55.879558  1364 model_lib_v2.py:632] Step 100 per-step time 1.153s loss=0.761

5. Additional context

it stopped with plenty of warnings but no errors.

part of the log was as follows:

W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.axis
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.gamma
W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.beta
W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.moving_mean
W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.moving_variance
W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.moving_variance
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.conv.kernel
W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.conv.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.axis
W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.axis
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.gamma
W0205 14:33:37.256316 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.beta
W0205 14:33:37.257313 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.moving_mean
W0205 14:33:37.257313 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.moving_mean
W0205 14:33:37.257313 10924 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0205 14:33:37.257313 10924 util.py:151] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

it stopped after printing W0205 14:33:37.257313, which was showing above.

6. System information

omar16100 commented 3 years ago

+1

vighneshbirodkar commented 3 years ago

Hi @ZhongHouyu

Sometimes training can take a long time to start due to TF compilation. I would recommend just waiting for 30 mins or so.

Also, you should set fine_tune_checkpoint_type: "fine_tune". The "detection" checkpoint type should be only used for the extrement checkpoint from the model zoo.

kulkarnivishal commented 2 years ago

@vighneshbirodkar what do you mean "The detection checkpoint type should be only used for the extrement (model) checkpoint from the model zoo" this is how it's documented in code -

  1. "classification": Restores only the classification backbone part of the feature extractor. This option is typically used when you want to train a detection model starting from a pre-trained image classification model, e.g. a ResNet model pre-trained on ImageNet
  2. "detection": Restores the entire feature extractor. The only parts of the full detection model that are not restored are the box and class prediction heads. This option is typically used when you want to use a pre-trained detection model and train on a new dataset or task which requires different box and class prediction heads.
  3. "full": Restores the entire detection model, including the feature extractor, its classification backbone, and the prediction heads. This option should only be used when the pre-training and fine-tuning tasks are the same. Otherwise, the model's parameters may have incompatible shapes, which will cause errors when attempting to restore the checkpoint.