tensorflow / models

Models and examples built with TensorFlow
Other
77.02k stars 45.78k forks source link

Can't transfer learn object detection models checkpoint for dataset with different classes #8701

Closed guijaci closed 4 years ago

guijaci commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection/model_main.py

2. Describe the bug

I'm trying to transfer learn a mobilenet with ssd model using the object detection API for another dataset with 8 classes. I'm following the Running Locally and the Using your own Dataset tutorial. After configuring the TF record, running model_main.py yields:

Assign requires shapes of both tensors to match. lhs shape= [54] rhs shape= [546]
     [[node save/Assign_14 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]

Those are from the last layers, as I inspected the model. My take is that the script is trying to restore the detection layers, even though the number of classes on the checkpoint is different and I set in pipeline.config the _trainconfig option: load_all_detection_checkpoint_vars: false

What reinforces this hypothesis is that the shape mentioned in the error is related to the number of classes: My dataset = 8 classes 54 = (8+1)*3*2 COCO dataset = 90 classes 546 = (90+1)*3*2

Sometimes the log changes, showing higher dimensions layers, or multiples of those numbers. But the problem generally is around the number axis being assigned like 54 << 546 and 27 << 273. When I change _numclasses in pipeline.config, it follows this pattern.

3. Steps to reproduce

  1. Install Model Garden
  2. Follow the Using your own Dataset to create your own tf record for a custom dataset.
  3. Download and extract a model checkpoint (eg. ssd_mobilenet_v2_coco_2018_03_29) http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz
  4. Configure pipeline.config from samples for a different number of classes (eg. 8)
  5. Follow the Running Locally tutorial.
  6. Wait until error shows

4. Expected behavior

I expected the script to not yield exception because of the number of classes when the option in _trainconfig inside pipeline.config is: load_all_detection_checkpoint_vars: false

5. Additional context

5.1 Colab location

https://colab.research.google.com/drive/1GpxZD3ORxIuaOUQDBjO3V_4fP6elylHM?usp=sharing

5.2 Tested models:

ssd_mobilenet_v2_coco_2018_03_29 ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03 ssdlite_mobilenet_v2_coco_2018_05_09 ssd_mobilenet_v3_large_coco_2020_01_14 ssd_mobilenet_v3_small_coco_2020_01_14 ssd_inception_v2_coco_2018_01_28

5.3 Directory Structure:

+models
    +oficial
    +research
    ...
    +model
        +train
        +eval
        -checkpoint
        -model.ckpt.meta
        -model.ckpt.index
        -model.ckpt.data-00000-of-00001
    +data
        -label_map.pbtxt
        -train.record
        -val.record
    -pipeline.config

5.4 Call to model_main.py:

python research/object_detection/model_main.py \
  --model_dir=model \
  --checkpoint_dir=model \
  --pipeline_config_path=pipeline.config \
  --num_train_steps=2000 \
  --num_eval_steps=200 \
  --alsologtostderr
output ``` WARNING:tensorflow:Forced number of epochs for all eval validations to be 1. W0619 20:55:39.260014 140535217379200 model_lib.py:717] Forced number of epochs for all eval validations to be 1. INFO:tensorflow:Maybe overwriting train_steps: None I0619 20:55:39.260228 140535217379200 config_util.py:523] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0619 20:55:39.260316 140535217379200 config_util.py:523] Maybe overwriting use_bfloat16: False INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: 1 I0619 20:55:39.260397 140535217379200 config_util.py:523] Maybe overwriting sample_1_of_n_eval_examples: 1 INFO:tensorflow:Maybe overwriting eval_num_epochs: 1 I0619 20:55:39.260483 140535217379200 config_util.py:523] Maybe overwriting eval_num_epochs: 1 INFO:tensorflow:Maybe overwriting load_pretrained: True I0619 20:55:39.260560 140535217379200 config_util.py:523] Maybe overwriting load_pretrained: True INFO:tensorflow:Ignoring config override key: load_pretrained I0619 20:55:39.260628 140535217379200 config_util.py:533] Ignoring config override key: load_pretrained WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1. W0619 20:55:39.261385 140535217379200 model_lib.py:733] Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1. INFO:tensorflow:create_estimator_and_inputs: use_tpu False, export_to_tpu False I0619 20:55:39.261496 140535217379200 model_lib.py:768] create_estimator_and_inputs: use_tpu False, export_to_tpu False INFO:tensorflow:Using config: {'_model_dir': 'model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I0619 20:55:39.261891 140535217379200 estimator.py:212] Using config: {'_model_dir': 'model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x7fd07f75ed08>) includes params argument, but params are not passed to Estimator. W0619 20:55:39.262096 140535217379200 model_fn.py:630] Estimator's model_fn (.model_fn at 0x7fd07f75ed08>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:Waiting for new checkpoint at model I0619 20:55:39.262688 140535217379200 checkpoint_utils.py:124] Waiting for new checkpoint at model INFO:tensorflow:Found new checkpoint at model/model.ckpt I0619 20:55:39.265674 140535217379200 checkpoint_utils.py:133] Found new checkpoint at model/model.ckpt INFO:tensorflow:Starting Evaluation. I0619 20:55:39.265813 140535217379200 model_lib.py:924] Starting Evaluation. WARNING:tensorflow:From /content/models/research/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`. W0619 20:55:39.301869 140535217379200 deprecation.py:323] From /content/models/research/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`. WARNING:tensorflow:From /content/models/research/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.map() W0619 20:55:39.321203 140535217379200 deprecation.py:323] From /content/models/research/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.map() WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Num' W0619 20:55:39.349258 140535217379200 ag_logging.py:146] Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Num' WARNING:tensorflow:Entity .transform_and_pad_input_data_fn at 0x7fd0a4bf9488> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Bad argument number for Name: 3, expecting 4 W0619 20:55:39.530438 140535217379200 ag_logging.py:146] Entity .transform_and_pad_input_data_fn at 0x7fd0a4bf9488> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Bad argument number for Name: 3, expecting 4 WARNING:tensorflow:From /content/models/research/object_detection/inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. W0619 20:55:39.535811 140535217379200 deprecation.py:323] From /content/models/research/object_detection/inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /content/models/research/object_detection/utils/ops.py:493: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0619 20:55:39.543541 140535217379200 deprecation.py:323] From /content/models/research/object_detection/utils/ops.py:493: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /content/models/research/object_detection/inputs.py:260: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead. W0619 20:55:39.602569 140535217379200 deprecation.py:323] From /content/models/research/object_detection/inputs.py:260: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead. INFO:tensorflow:Calling model_fn. I0619 20:55:40.018049 140535217379200 estimator.py:1148] Calling model_fn. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tf_slim/layers/layers.py:1089: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. W0619 20:55:40.041596 140535217379200 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tf_slim/layers/layers.py:1089: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. INFO:tensorflow:depth of additional conv before box predictor: 0 I0619 20:55:42.220022 140535217379200 convolutional_box_predictor.py:156] depth of additional conv before box predictor: 0 INFO:tensorflow:depth of additional conv before box predictor: 0 I0619 20:55:42.248053 140535217379200 convolutional_box_predictor.py:156] depth of additional conv before box predictor: 0 INFO:tensorflow:depth of additional conv before box predictor: 0 I0619 20:55:42.275980 140535217379200 convolutional_box_predictor.py:156] depth of additional conv before box predictor: 0 INFO:tensorflow:depth of additional conv before box predictor: 0 I0619 20:55:42.303051 140535217379200 convolutional_box_predictor.py:156] depth of additional conv before box predictor: 0 INFO:tensorflow:depth of additional conv before box predictor: 0 I0619 20:55:42.330492 140535217379200 convolutional_box_predictor.py:156] depth of additional conv before box predictor: 0 INFO:tensorflow:depth of additional conv before box predictor: 0 I0619 20:55:42.358266 140535217379200 convolutional_box_predictor.py:156] depth of additional conv before box predictor: 0 WARNING:tensorflow:From /content/models/research/object_detection/eval_util.py:830: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead. W0619 20:55:43.228709 140535217379200 deprecation.py:323] From /content/models/research/object_detection/eval_util.py:830: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead. WARNING:tensorflow:From /content/models/research/object_detection/utils/visualization_utils.py:618: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, there are two options available in V2. - tf.py_function takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means `tf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape. - tf.numpy_function maintains the semantics of the deprecated tf.py_func (it is not differentiable, and manipulates numpy arrays). It drops the stateful argument making all functions stateful. W0619 20:55:43.418199 140535217379200 deprecation.py:323] From /content/models/research/object_detection/utils/visualization_utils.py:618: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, there are two options available in V2. - tf.py_function takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means `tf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape. - tf.numpy_function maintains the semantics of the deprecated tf.py_func (it is not differentiable, and manipulates numpy arrays). It drops the stateful argument making all functions stateful. INFO:tensorflow:Done calling model_fn. I0619 20:55:44.068082 140535217379200 estimator.py:1150] Done calling model_fn. INFO:tensorflow:Starting evaluation at 2020-06-19T20:55:44Z I0619 20:55:44.083686 140535217379200 evaluation.py:255] Starting evaluation at 2020-06-19T20:55:44Z INFO:tensorflow:Graph was finalized. I0619 20:55:44.511930 140535217379200 monitored_session.py:240] Graph was finalized. 2020-06-19 20:55:44.517003: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz 2020-06-19 20:55:44.517266: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a2dd40 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-06-19 20:55:44.517301: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-06-19 20:55:44.519295: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-06-19 20:55:44.610785: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-19 20:55:44.611862: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a2d800 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-06-19 20:55:44.611893: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0 2020-06-19 20:55:44.612135: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-19 20:55:44.612709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:00:04.0 2020-06-19 20:55:44.613019: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-19 20:55:44.614887: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-06-19 20:55:44.616497: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-06-19 20:55:44.616876: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-06-19 20:55:44.618470: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-06-19 20:55:44.619146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-06-19 20:55:44.621979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-19 20:55:44.622100: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-19 20:55:44.622696: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-19 20:55:44.623234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0 2020-06-19 20:55:44.623290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-19 20:55:44.624526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-19 20:55:44.624551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 2020-06-19 20:55:44.624561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N 2020-06-19 20:55:44.624689: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-19 20:55:44.625250: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-06-19 20:55:44.625787: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2020-06-19 20:55:44.625824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14974 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0) INFO:tensorflow:Restoring parameters from model/model.ckpt I0619 20:55:44.628076 140535217379200 saver.py:1284] Restoring parameters from model/model.ckpt Traceback (most recent call last): File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,256,54] rhs shape= [3,3,256,546] [[{{node save/Assign_15}}]] (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,256,54] rhs shape= [3,3,256,546] [[{{node save/Assign_15}}]] [[save/RestoreV2/_550]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 1290, in restore {self.saver_def.filename_tensor_name: save_path}) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,256,54] rhs shape= [3,3,256,546] [[node save/Assign_15 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]] (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,256,54] rhs shape= [3,3,256,546] [[node save/Assign_15 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]] [[save/RestoreV2/_550]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'save/Assign_15': File "research/object_detection/model_main.py", line 114, in tf.app.run() File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "research/object_detection/model_main.py", line 99, in main train_steps, name, FLAGS.max_eval_retries) File "/content/models/research/object_detection/model_lib.py", line 931, in continuous_eval max_retries=max_retries) File "/content/models/research/object_detection/model_lib.py", line 887, in _evaluate_checkpoint name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval return _evaluate() File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 504, in _evaluate self._evaluate_build_graph(input_fn, hooks, checkpoint_path)) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1511, in _evaluate_build_graph self._call_model_fn_eval(input_fn, self.config)) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1547, in _call_model_fn_eval features, labels, ModeKeys.EVAL, config) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/content/models/research/object_detection/model_lib.py", line 606, in model_fn save_relative_paths=True) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 828, in __init__ self.build() File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 840, in build self._build(self._filename, build_save=True, build_restore=True) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 878, in _build build_restore=build_restore) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 502, in _build_internal restore_sequentially, reshape) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps name="restore_shard")) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 350, in _AddRestoreOps assign_ops.append(saveable.restore(saveable_tensors, shapes)) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saving/saveable_object_util.py", line 73, in restore self.op.get_shape().is_fully_defined()) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/state_ops.py", line 227, in assign validate_shape=validate_shape) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/gen_state_ops.py", line 66, in assign use_locking=use_locking, name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack() During handling of the above exception, another exception occurred: Traceback (most recent call last): File "research/object_detection/model_main.py", line 114, in tf.app.run() File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "research/object_detection/model_main.py", line 99, in main train_steps, name, FLAGS.max_eval_retries) File "/content/models/research/object_detection/model_lib.py", line 931, in continuous_eval max_retries=max_retries) File "/content/models/research/object_detection/model_lib.py", line 893, in _evaluate_checkpoint raise e File "/content/models/research/object_detection/model_lib.py", line 887, in _evaluate_checkpoint name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval return _evaluate() File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 511, in _evaluate output_dir=self.eval_dir(name)) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1619, in _evaluate_run config=self._session_config) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/evaluation.py", line 269, in _evaluate_once session_creator=session_creator, hooks=hooks) as session: File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__ stop_grace_period_secs=stop_grace_period_secs) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 725, in __init__ self._sess = _RecoverableSession(self._coordinated_creator) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__ _WrappedSession.__init__(self, self._create_session()) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session return self._sess_creator.create_session() File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 878, in create_session self.tf_sess = self._session_creator.create_session() File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 647, in create_session init_fn=self._scaffold.init_fn) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session config=config) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/session_manager.py", line 204, in _restore_checkpoint saver.restore(sess, checkpoint_filename_with_path) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 1326, in restore err, "a mismatch between the current graph and the graph") tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: 2 root error(s) found. (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,256,54] rhs shape= [3,3,256,546] [[node save/Assign_15 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]] (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,256,54] rhs shape= [3,3,256,546] [[node save/Assign_15 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]] [[save/RestoreV2/_550]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'save/Assign_15': File "research/object_detection/model_main.py", line 114, in tf.app.run() File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "research/object_detection/model_main.py", line 99, in main train_steps, name, FLAGS.max_eval_retries) File "/content/models/research/object_detection/model_lib.py", line 931, in continuous_eval max_retries=max_retries) File "/content/models/research/object_detection/model_lib.py", line 887, in _evaluate_checkpoint name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval return _evaluate() File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 504, in _evaluate self._evaluate_build_graph(input_fn, hooks, checkpoint_path)) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1511, in _evaluate_build_graph self._call_model_fn_eval(input_fn, self.config)) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1547, in _call_model_fn_eval features, labels, ModeKeys.EVAL, config) File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/content/models/research/object_detection/model_lib.py", line 606, in model_fn save_relative_paths=True) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 828, in __init__ self.build() File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 840, in build self._build(self._filename, build_save=True, build_restore=True) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 878, in _build build_restore=build_restore) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 502, in _build_internal restore_sequentially, reshape) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps name="restore_shard")) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py", line 350, in _AddRestoreOps assign_ops.append(saveable.restore(saveable_tensors, shapes)) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saving/saveable_object_util.py", line 73, in restore self.op.get_shape().is_fully_defined()) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/state_ops.py", line 227, in assign validate_shape=validate_shape) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/gen_state_ops.py", line 66, in assign use_locking=use_locking, name=name) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack() ```

5.5 Example pipeline.config:

For ssd_mobilenet_v2_coco_2018_03_29

pipeline.config ``` model { ssd { num_classes: 8 image_resizer { fixed_shape_resizer { height: 300 width: 300 } } feature_extractor { type: "ssd_mobilenet_v2" depth_multiplier: 1.0 min_depth: 16 conv_hyperparams { regularizer { l2_regularizer { weight: 3.99999989895e-05 } } initializer { truncated_normal_initializer { mean: 0.0 stddev: 0.0299999993294 } } activation: RELU_6 batch_norm { decay: 0.999700009823 center: true scale: true epsilon: 0.0010000000475 train: true } } use_depthwise: true } box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true } } similarity_calculator { iou_similarity { } } box_predictor { convolutional_box_predictor { conv_hyperparams { regularizer { l2_regularizer { weight: 3.99999989895e-05 } } initializer { truncated_normal_initializer { mean: 0.0 stddev: 0.0299999993294 } } activation: RELU_6 batch_norm { decay: 0.999700009823 center: true scale: true epsilon: 0.0010000000475 train: true } } min_depth: 0 max_depth: 0 num_layers_before_predictor: 0 use_dropout: false dropout_keep_probability: 0.800000011921 kernel_size: 3 box_code_size: 4 apply_sigmoid_to_scores: false } } anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.20000000298 max_scale: 0.949999988079 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.333299994469 } } post_processing { batch_non_max_suppression { score_threshold: 0.300000011921 iou_threshold: 0.600000023842 max_detections_per_class: 100 max_total_detections: 100 } score_converter: SIGMOID } normalize_loss_by_num_matches: true loss { localization_loss { weighted_smooth_l1 { } } classification_loss { weighted_sigmoid { } } hard_example_miner { num_hard_examples: 3000 iou_threshold: 0.990000009537 loss_type: CLASSIFICATION max_negatives_per_positive: 3 min_negatives_per_image: 3 } classification_weight: 1.0 localization_weight: 1.0 } } } train_config { batch_size: 24 data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { ssd_random_crop { } } optimizer { rms_prop_optimizer { learning_rate { exponential_decay_learning_rate { initial_learning_rate: 0.00400000018999 decay_steps: 800720 decay_factor: 0.949999988079 } } momentum_optimizer_value: 0.899999976158 decay: 0.899999976158 epsilon: 1.0 } } fine_tune_checkpoint: "model/model.ckpt" load_all_detection_checkpoint_vars: false from_detection_checkpoint: true num_steps: 200000 fine_tune_checkpoint_type: "detection" } train_input_reader: { tf_record_input_reader { input_path: "data/train.record" } label_map_path: "annotations/label_map.pbtxt" } eval_config: { num_examples: 1188 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. max_evals: 10 use_moving_averages: false } eval_input_reader: { tf_record_input_reader { input_path: "data/val.record" } label_map_path: "annotations/label_map.pbtxt" shuffle: false num_readers: 1 } ```

5.6 Before you ask...

6. System information

veonua commented 4 years ago

try to run from classification checkpoint

from_detection_checkpoint: false

guijaci commented 4 years ago

So, I solved the issue:

Now I can properly train the model. Thing is, I don't really know if I'm doing transfer learn now. What happens if the _checkpointdir points to empty folder (where I want to keep the new dataset checkpoints) and the _fine_tunecheckpoint to where the pre trained variables are?