tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.77k forks source link

Object Detection API 2.0, error with load checkpoints: A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. #8892

Closed DongChen06 closed 4 years ago

DongChen06 commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection

2. Describe the bug

Thanks for releasing the Object Detection API 2.0. I am trying to build the model on my own dataset. I downloaded the trained file from model zoo CenterNet HourGlass104 512x512. Then changed the configure file and test the code. A bug comes.

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.moving_variance
W0716 19:56:53.424076 140587994642240 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.conv_block.norm.moving_variance
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.conv.kernel
W0716 19:56:53.424108 140587994642240 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.conv.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.axis
W0716 19:56:53.424140 140587994642240 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.axis
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.gamma
W0716 19:56:53.424172 140587994642240 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.beta
W0716 19:56:53.424204 140587994642240 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.moving_mean
W0716 19:56:53.424236 140587994642240 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.moving_variance
W0716 19:56:53.424268 140587994642240 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor._network.hourglass_network.1.inner_block.0.inner_block.0.inner_block.0.inner_block.0.decoder_block.1.skip.norm.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0716 19:56:53.424301 140587994642240 util.py:152] **A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.**

A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

I do not know how to resolve this issue!

6. System information

davide-scalzo commented 4 years ago

I'm having the exact same issue.

Folder structure

data/
├── labels.pbtxt
├── train.record
├── test.record

model/ #extracted from http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d0_coco17_tpu-32.tar.gz
├── saved_model/
    ├── assets/
    ├── variables/
    ├── saved_model.pb
├── checkpoint/
    ├── checkpoint
    ├── ckpt-0.data-00000-of-00001
    ├── ckpt-0.index
├── pipeline.config
train.py # copy of model_main_tf2.py

command run python train.py --alsologtostderr --model_dir=model/ --pipeline_config_path=model/pipeline.config

Config

...
  fine_tune_checkpoint: "model/ckpt-0"
  num_steps: 300000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  use_bfloat16: true
  fine_tune_checkpoint_version: V2
}
train_input_reader: {
  label_map_path: "data/labels.pbtxt"
  tf_record_input_reader {
    input_path: "data/train.tfrecord"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

eval_input_reader: {
  label_map_path: "data/labels.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "data/test.tfrecord"
  }
}

It does create a /train folder under model but fails with the following output before any learning happens

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.beta
W0717 10:59:19.085086 140059259959104 util.py:144] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_mean
W0717 10:59:19.085122 140059259959104 util.py:144] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_variance
W0717 10:59:19.085184 140059259959104 util.py:144] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0717 10:59:19.085304 140059259959104 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
Traceback (most recent call last):
  File "train.py", line 114, in <module>
    tf.compat.v1.app.run()
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train.py", line 111, in main
    use_tpu=FLAGS.use_tpu)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 569, in train_loop
    ckpt.restore(latest_checkpoint)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 2009, in restore
    status = self._saver.restore(save_path=save_path)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1304, in restore
    checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 209, in restore
    restore_ops = trackable._restore_from_checkpoint_position(self)  # pylint: disable=protected-access
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 907, in _restore_from_checkpoint_position
    tensor_saveables, python_saveables))
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 289, in restore_saveables
    validated_saveables).restore(self.save_path_tensor)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 281, in restore
    restore_ops.update(saver.restore(file_prefix))
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 103, in restore
    restored_tensors, restored_shapes=None)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 647, in restore
    for v in self._mirrored_variable.values))
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 647, in <genexpr>
    for v in self._mirrored_variable.values))
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 392, in _assign_on_device
    return variable.assign(tensor)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 846, in assign
    self._shape.assert_is_compatible_with(value_tensor.shape)
  File "/home/davide/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/tensor_shape.py", line 1117, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (9,) and (810,) are incompatible
WARNING:tensorflow:Unresolved object in checkpoint: (root).save_counter
W0717 10:59:21.835075 140059259959104 util.py:144] Unresolved object in checkpoint: (root).save_counter
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0717 10:59:21.835391 140059259959104 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

Probably also worth pointing out that I tested the installation successfully with python object_detection/builders/model_builder_tf2_test.py

bobokvsky commented 4 years ago

I have the same issue as well.

I tried to load checkpoints from models ssd_mobilenet_v1_fpn_640x640, efficientdet_d0_coco17 and none of the them are loaded properly.

Shakesbeery commented 4 years ago

Issue confirmed for me as well for all pre-trained EfficientDet models in the zoo. Other model types not tested, yet.

W0717 08:31:07.516187  7684 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

Running model_main_tf2.py fails silently with no error and no traceback, just the above warning.

model_lib_tf2_test.py passes with 3 skips, no failures.

OS Platform and Distribution: Windows 10 TensorFlow installed from (source or binary): pip install tensorflow==2.2 TensorFlow version (use command below): tensorflow 2.2.0 Python version: 3.7 CUDA/cuDNN version: CUDA 10.1 / CuDNN 7.6.5 GPU model and memory: 1080 Ti (12Gb -- 8.5 available)

xieyh commented 4 years ago

same issue model: centernet_resnet50_v1_fpn_512x512_coco17_tpu-8.tar efficientdet_d0_coco17_tpu-32.tar ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar faster_rcnn_resnet50_v1_640x640_coco17_tpu-8.tar centernet_hg104_512x512_coco17_tpu-8.tar ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar

Windows 10 TensorFlow version (use command below): tensorflow 2.2.0 Python version: 3.7 CUDA/cuDNN version: CUDA 10.1 / CuDNN 7.6.5 GPU : 1660 Ti

wronk commented 4 years ago

I seem to be running into this same issue with loading a config value for Mask RCNN on Mac with TF 2.2.0

I'm using the Mask RCNN models weights from the bottom of the TF2 Model Detection Zoo page and the config example MASK RCNN config sample here. I have the same issue where it I get warnings (~150) for layers failing to load. For example: WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor_for_box_classifier_features.layer_with_weights-14.kernel W0717 14:29:02.269019 4319387072 util.py:144] Unresolved object in checkpoint: (root).model._feature_extractor_for_box_classifier_features.layer_with_weights-14.kernel

The fine tune checkpoint specified the example .config file uses a filename (inception_resnet_v2.ckpt-1) not in the zipped checkpoint from the model zoo. I'm not sure if that's the problem. Separately, I also noticed that under the train_config, I had to set fine_tune_checkpoint_version: V2 or it` would fail to accept the configuration.

tnb-wu commented 4 years ago

I have the same kind of issue.

model and checkpoint

SSD MobileNet v2 320x320 from TensorFlow 2 Detection Model Zoo

config

edited this pipeline.config for my local files. I've changed the parameter in my config file num_classes = 90 to num_classes = 13 for my original dataset.

command

python object_detection/model_main_tf2.py 
--pipeline_config_path=/my_model_dir/my_model.config 
--model_dir=/my_model_dir/ 
--alsologtostderr

I get the following error.

W0720 05:36:27.828208 139692008720192 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
Traceback (most recent call last):
  File "object_detection/model_main_tf2.py", line 106, in <module>
    tf.compat.v1.app.run()
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "object_detection/model_main_tf2.py", line 103, in main
    use_tpu=FLAGS.use_tpu)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/object_detection/model_lib_v2.py", line 569, in train_loop
    ckpt.restore(latest_checkpoint)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 2009, in restore
    status = self._saver.restore(save_path=save_path)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 1304, in restore
    checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py", line 209, in restore
    restore_ops = trackable._restore_from_checkpoint_position(self)  # pylint: disable=protected-access
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py", line 907, in _restore_from_checkpoint_position
    tensor_saveables, python_saveables))
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 289, in restore_saveables
    validated_saveables).restore(self.save_path_tensor)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py", line 281, in restore
    restore_ops.update(saver.restore(file_prefix))
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py", line 103, in restore
    restored_tensors, restored_shapes=None)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 647, in restore
    for v in self._mirrored_variable.values))
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 647, in <genexpr>
    for v in self._mirrored_variable.values))
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 392, in _assign_on_device
    return variable.assign(tensor)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 846, in assign
    self._shape.assert_is_compatible_with(value_tensor.shape)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 1117, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (42,) and (273,) are incompatible

If I set in the config file num_classes = 90(default value), the training process starts running.

environment

OS: Ubuntu 18.04 Python: 3.6 TensorFlow: 2.2.0 cuda/cuDNN: 10.0/7.6.5 GPU: RTX 2080Ti

MaesIT commented 4 years ago

I seem to be having the same issue as the original poster.
I successfully trained the efficientdet_d0 (from scratch, for +- 5000steps) then tried to train the efficientdet_d1 with the "fine_tune_checkpoint" pointing to the efficientdet_0 final checkpoint but then I also get the warnings and the training does not start:

WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'momentum' for (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.0.bias W0720 10:31:50.390428 140235167352640 util.py:144] Unresolved object in checkpoint: (root).optimizer's state 'momentum' for (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.0.bias WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'momentum' for (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.gamma W0720 10:31:50.390517 140235167352640 util.py:144] Unresolved object in checkpoint: (root).optimizer's state 'momentum' for (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.gamma WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'momentum' for (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.beta W0720 10:31:50.390611 140235167352640 util.py:144] Unresolved object in checkpoint: (root).optimizer's state 'momentum' for (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.beta WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details. W0720 10:31:50.390712 140235167352640 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

I'm running this training on a single gpu machine, so I've commented the "sync_replicas" and "replicas_to_aggregate" parameters + I've tuned the hyperparams abit to make the model produce some output (lr & batch size)

marvision-ai commented 4 years ago

Same issue:

  File "/home/musashi/.virtualenvs/tf2.0/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 1117, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (36,) and (810,) are incompatible
WARNING:tensorflow:Unresolved object in checkpoint: (root).save_counter
W0721 12:58:29.073055 139770281662272 util.py:144] Unresolved object in checkpoint: (root).save_counter
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0721 12:58:29.073232 139770281662272 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
BernardinD commented 4 years ago

I was able to get passed this error by changing the fine_tune_checkpoint_type to "detection"

UPDATE: I'm running this training on Colab and I keep getting memory allocation issues. I had to resort to running with a batch_size of 1.... Any suggestions?

siyangbing commented 4 years ago

i have the same problems!

nrasadi commented 4 years ago

@BernardinD

I was able to get passed this error by changing the fine_tune_checkpoint_type to "detection"

It also worked for me.

UPDATE: I'm running this training on Colab and I keep getting memory allocation issues. I had to resort to running with a batch_size of 1.... Any suggestions?

Why don't you use TPU on Colab?

hasansalimkanmaz commented 4 years ago

Not a real answer but If you want to train the model anyway. This works.

I have encountered the same situation. Then, I commented fine_tune_checkpoint_version: V2 fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/ckpt-1" fine_tune_checkpoint_type: "detection" (I think this means not using pre-trained model). After this change, I manage to start the training. I will see the results but I am not expecting much :( .

BouleJaune commented 4 years ago

From this notebook: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/eager_few_shot_od_training_tf2_colab.ipynb

We see that they restore the checkpoint with .expect_partial()

However in model_lib_v2.py they load the checkpoint without this. (line 569)

I know that in the first version some variables weren't loaded every time and it was apparently normal if I'm not wrong

Shakesbeery commented 4 years ago

I was able to get passed this error by changing the fine_tune_checkpoint_type to "detection"

This works for some models, but others like CenterNet1024 still fail in the same manner.

wronk commented 4 years ago

Setting/changing the fine_tune_checkpoint_type to detection for MaskRCNN also doesn't seem to work.

Renart-fox commented 4 years ago

I can confirm I am able to reproduce this with the EfficientDet D7 model given in the TF2 model zoo . Changing fine_tune_checkpoint_type to detection does not solve the issue.

Tensorflow version: latest stable TFOD installation branch: master

zishanahmed08 commented 4 years ago

@tombstone Hi Vivek, Could you please help us out here.Also please confirm if fine_tune_checkpoint_type to detection not use the pretrained model?

lakshay1296 commented 4 years ago

Hello, I'm facing the same issue using "faster_rcnn_resnet101_v1_1024x1024_coco17_tpu-8", "centernet_hg104_1024x1024_coco17_tpu-32", and "efficientdet_d2_coco17_tpu-32". They all have the same sturcture as mentioned below. Also, changing fine_tune_checkpoint_type to detection does not work either.

├── checkpoint │   ├── checkpoint │   ├── ckpt-0.data-00000-of-00001 │   └── ckpt-0.index ├── pipeline.config └── saved_model ├── saved_model.pb └── variables ├── variables.data-00000-of-00001 └── variables.index

I am using google cloud compute engine for this task. CPU: N1 type 8-cores Ram: 40GB GPU: N/A OS: Ubuntu 18.04 Python: 3.6 Tensorflow: 2.2

Geoyi commented 4 years ago

I got the same error for trying to use both fast rcnn and context rcnn, have anyone solve the issue? Confirming that change fine_tune_checkpoint_type to detection doesn't help in these cases as well.

nicholasguimaraes commented 4 years ago

Good evening everyone, I'm trying to fine tune efficientdet_d4_coco17_tpu-32 and I'm also facing the same issue mentioned above. raise ValueError("Shapes %s and %s are incompatible" % (self, other)) ValueError: Shapes (224,) and (256,) are incompatible

Tried fine tuning Efficientdet d2 and got the same error Shapes (112,) and (256,) are incompatible

I realized that the depth of the box_predictor was 224 on effnet d4 and it is 112 on effnet b2.

Still working on a solution

Could this be regarding the .tfrecord file?

nicholasguimaraes commented 4 years ago

Hello again, I got around this error (ValueError: Shapes (112,) and (256,) are incompatible) setting pad_to_max_dimension to false on my config file.

image_resizer { keep_aspect_ratio_resizer { min_dimension: 768 max_dimension: 768 pad_to_max_dimension: false }

aabbas90 commented 4 years ago

Having the same issue on finetuning CenterNet on COCO17. Also, training from scratch is working fine.

Deepthi-Jain commented 4 years ago

@aabbas90 can you please provide the steps to train from the scratch?? i'm also having the same issue. thanks

Deepthi-Jain commented 4 years ago

Having the same issue on finetuning CenterNet on COCO17. Also, training from scratch is working fine.

can you please provide the steps to train from the scratch?? i'm also having the same issue. thanks

Deepthi-Jain commented 4 years ago

Not a real answer but If you want to train the model anyway. This works.

I have encountered the same situation. Then, I commented fine_tune_checkpoint_version: V2 fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/ckpt-1" fine_tune_checkpoint_type: "detection" (I think this means not using pre-trained model). After this change, I manage to start the training. I will see the results but I am not expecting much :( .

can you please let me know what happend to your training?

hasansalimkanmaz commented 4 years ago

Not a real answer but If you want to train the model anyway. This works. I have encountered the same situation. Then, I commented fine_tune_checkpoint_version: V2 fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/ckpt-1" fine_tune_checkpoint_type: "detection" (I think this means not using pre-trained model). After this change, I manage to start the training. I will see the results but I am not expecting much :( .

can you please let me know what happend to your training?

I gave up with tf object detection api. It took a lot of time for me. I switched to detectron2.

aabbas90 commented 4 years ago

Not a real answer but If you want to train the model anyway. This works. I have encountered the same situation. Then, I commented fine_tune_checkpoint_version: V2 fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/ckpt-1" fine_tune_checkpoint_type: "detection" (I think this means not using pre-trained model). After this change, I manage to start the training. I will see the results but I am not expecting much :( .

can you please let me know what happend to your training?

I set to one field of fine_tune_checkpoint empty i.e. fine_tune_checkpoint: " ". If rest of the things are fine it should train from scratch for you.

psychonetic commented 4 years ago

Maybe it is not intended, but when using tensorflow 2.3 I am actually able to train a model. But I guess this may can result in other problems. My model is still training, I may edit this post, if it's done.

aabbas90 commented 4 years ago

Maybe it is not intended, but when using tensorflow 2.3 I am actually able to train a model. But I guess this may can result in other problems. My model is still training, I may edit this post, if it's done.

Are you also able to fine-tune from the provided checkpoints in the API?

veonua commented 4 years ago

update to tf 2.3 doesn't solve the problem.

it looks like change fine-tuning mode to "detection" allows to run training, but I can't say if it saves classes of original model or not.

overall the fine-tuning is made somehow wrong (special format, various incompatible versions, gpu/tpu) so I changed my pipeline to load checkpoints to do fine-tuning in my experiments

On Wed, Aug 5, 2020, 08:40 Ahmed Abbas notifications@github.com wrote:

Maybe it is not intended, but when using tensorflow 2.3 I am actually able to train a model. But I guess this may can result in other problems. My model is still training, I may edit this post, if it's done.

Are you also able to fine-tune from the provided checkpoints in the API?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/8892#issuecomment-669011089, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYALREF3CHYHYGECHPR7TR7D5G3ANCNFSM4O5HIKLQ .

aabbas90 commented 4 years ago

update to tf 2.3 doesn't solve the problem. it looks like change fine-tuning mode to "detection" allows to run training, but I can't say if it saves classes of original model or not. overall the fine-tuning is made somehow wrong (special format, various incompatible versions, gpu/tpu) so I changed my pipeline to load checkpoints to do fine-tuning in my experiments On Wed, Aug 5, 2020, 08:40 Ahmed Abbas @.***> wrote: Maybe it is not intended, but when using tensorflow 2.3 I am actually able to train a model. But I guess this may can result in other problems. My model is still training, I may edit this post, if it's done. Are you also able to fine-tune from the provided checkpoints in the API? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8892 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYALREF3CHYHYGECHPR7TR7D5G3ANCNFSM4O5HIKLQ .

+1.

I tried multi-gpu training from scratch as I mentioned here: https://github.com/tensorflow/models/issues/5565#issuecomment-669123077. The checkpoint created through this way does not allow fine-tuning on single gpu.

jrash33 commented 4 years ago

i'm getting this same error as well! support needed!

wardeha commented 4 years ago

Getting the same warnings when I try to using the initial check point. I am using TF 2.3 and trying to train on a novel data set with 2 classes using ssd_mobilenet_v2_320x320

mosch91-syn commented 4 years ago

Can confirm that setting fine_tune_checkpoint_type to "detection" worked for me as well. Successfully trained a SSD Mobilenet V2 fpn and an EfficientDet_D0 on custom dataset. This was also stated in the tutorial https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/index.html . Stable version is 2.2. Maybe those of you trying with 2.3 might give it a try after downgrading to 2.2.

Renart-fox commented 4 years ago

A solution that doesn't work for everyone is not a solution

aabbas90 commented 4 years ago

Can confirm that setting fine_tune_checkpoint_type to "detection" worked for me as well. Successfully trained a SSD Mobilenet V2 fpn and an EfficientDet_D0 on custom dataset. This was also stated in the tutorial https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/index.html . Stable version is 2.2. Maybe those of you trying with 2.3 might give it a try after downgrading to 2.2.

Can you please point to the commit of this repo you are using, I had already tried on 2.2 version with same issue. Moreover, which kind of training hardware are you using is it single-gpu, multi-gpu or TPU? Thanks!

aabbas90 commented 4 years ago

Please note that the claim mentioned here: https://github.com/tensorflow/models/issues/8967#issuecomment-665082686 is not true as I was not able to fine-tune even by setting fine_tune_checkpoint_type: "fine_tune".

vighneshbirodkar commented 4 years ago

I can comment on the CenterNet issues:

  1. "detection" checkpoints are currently supported by hourglass models. The error messages are pointing towards the fact that the checkpoints weights are not in the right format. We currently only support one "detection" type checkpoint with the hourglass model. It is the ExtrementNet checkpoint from the TF2 Zoo. Once you download an unzip the file, the path should point to /path/to/file/extremenet/ckpt-1
  2. Once #9089 is in, "fine_tune" checkpoints will be supported for all CenterNet models. For "fine_tune", the path should point to a CenterNet* checkpoint of the same type in the TF2 Zoo. For example /path/to/file/centernet_hg104_512x512_coco17_tpu-8/checkpoint/ckpt-0
  3. "classification" is supported for ResNet based feature extractors and you can use this script to create them.
missmantou commented 4 years ago

Just to report, changing the fine_tune_checkpoint_type to "detection" works for me for both faster rcnn resnet and ssd mobilenet.

vighneshbirodkar commented 4 years ago

@aabbas90 's comment is fixed with https://github.com/tensorflow/models/commit/fd6987fafb615427316c0bfac6fdb185273fcfcc

vighneshbirodkar commented 4 years ago

For further clarifications, it would be helpful if we knew the exact contents of the config being used along with a link to which checkpoint you are using.

aabbas90 commented 4 years ago
  1. I am able to fine_tune centernet_resnet50_v1_fpn_512x512_coco17_tpu-8 by the associated pipeline.config and pre-trained model from TF2 model zoo. Where I use: fine_tune_checkpoint_type: "fine_tune")
  2. Carrying the same procedure on efficientdet_d0_coco17_tpu-32 from TF2 model zoo, fine-tuning does not work but detection does. @vighneshbirodkar: I see a conflicting definition between detection and fine_tune checkpoint types comparing https://github.com/tensorflow/models/blob/2bef12e6fd830df331a858a3ca29a18357551e16/research/object_detection/meta_architectures/ssd_meta_arch.py#L1319-L1329 with https://github.com/tensorflow/models/blob/2bef12e6fd830df331a858a3ca29a18357551e16/research/object_detection/meta_architectures/center_net_meta_arch.py#L3006-L3024 SSD is loading everything in detection mode, whereas CenterNet is loading only the feature extractor.
  3. @vighneshbirodkar could you also please clarify which config files should a user start from: a. The ones given inside the repo i.e., in https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2 b. Or by the config files from the downloaded models from TF2 model zoo. Note that these config files have different parameters at-least for efficientdet_d0.
vighneshbirodkar commented 4 years ago
  1. Great, that is the intended usage. You should also be able to use detection with the extrement checkpoint.

  2. SSD is not loading the whole model, the lines

      fake_model = tf.train.Checkpoint(
          _feature_extractor=self._feature_extractor)
      return {'model': fake_model}

    ensure that only the feature extractor is loaded.

  3. I would start with the ones in configs/tf2

fclof commented 4 years ago

I have sucessfully fine-tuned the 1024 version of faster rcnn and efficientdet. However, when I am fine-tuning CenterNet HourGlass104 1024x1024, I find the same issue and my program will automatically get killed. I am training on a RTX 2080 Ti.

Update: After reinstall the latest version of object detection api, and modify the config file from "detection" to "fine_tune", the issue disappeared, and I am now trainig cernet as expected.

  1. Once #9089 is in, "fine_tune" checkpoints will be supported for all CenterNet models.
aabbas90 commented 4 years ago

@fclof Are you observing very slow convergence on CenterNet? I have tried fine-tuning both efficientdet_d0 and centerNet_hourglass, efficientDet converges really well while CenterNet is not showing any signs of convergence. Thanks!

vighneshbirodkar commented 4 years ago

What batch size and learning rate are you using @aabbas90 ?

aabbas90 commented 4 years ago

What batch size and learning rate are you using @aabbas90 ?

@vighneshbirodkar : batch_size : 8, learning_rate : 5e-4 I have also tried centernet_resnet50_v1_fpn_512x512 and also do not see convergence after one day of training on a relatively simpler dataset than COCO2017. Specifically, the object_center loss almost always remain above 2.0, while for efficientDet_d0 the total loss dips very fast already after one hour of training.

navganti commented 4 years ago

Hi everyone, I'm experiencing the same issue seen here.

System information

OS Platform and Distribution: Ubuntu 18.04 TensorFlow installed from (source or binary): Installed using pip in a virtualenv. TensorFlow version: tensorflow 2.3.0 Python version: 3.6.10 CUDA/cuDNN version: CUDA 10.1, CuDNN 7.6.5 GPU model and memory: RTX 2070 Super, 8GB VRAM Object Detection API: latest - models @ 40e124320636797487b4db476511bf7147616a93

Models Tested

For all of the above I'm seeing the incompatible shapes error when restoring the checkpoints. In all cases the fine_tune_checkpoint_type was set to detection, except for CenterNet HourGlass, which was set to fine_tune. When running CenterNet HourGlass with detection, I get a different error: AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program.

In all cases, the failure occurs when trying to restore the checkpoints. I've downloaded my checkpoints from the TF2 Model Zoo, and I've modified the configs found in configs/tf2.

In all cases I've set my batch sizes to be 2 just to try and get anything running. I have changed a few settings in my config files, such as:

Happy to provide more information - any help would be much appreciated!

aabbas90 commented 4 years ago

What batch size and learning rate are you using @aabbas90 ?

@vighneshbirodkar : batch_size : 8, learning_rate : 5e-4 I have also tried centernet_resnet50_v1_fpn_512x512 and also do not see convergence after one day of training on a relatively simpler dataset than COCO2017. Specifically, the object_center loss almost always remain above 2.0, while for efficientDet_d0 the total loss dips very fast already after one hour of training.

UPDATE: Even though the loss is not decreasing much as per my expectations, however, the evaluation metrics on test data still look very promising therefore I think this is no longer an issue.

vighneshbirodkar commented 4 years ago

@navganti Which checkpoint did you download for CenterNet and what is your exact checkpoint type and checkpoint path ?