shamik111691 commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ Yes] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[ Yes] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[ Yes] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection/legacy/train.py...

2. Describe the bug

Crash while training. Get the following error ValueError: Dimension 1 in both shapes must be equal, but are 9 and 17. Shapes are [64,9,9] and [64,17,17]. for '{{node model_1/mixed_7a/concat}} = ConcatV2[N=4, T=DT_FLOAT, Tidx=DT_INT32](model_1/activation_157/Relu, model_1/activation_159/Relu, model_1/activation_162/Relu, model_1/max_pooling2d_3/MaxPool, model_1/mixed_7a/concat/axis)' with input shapes: [64,17,17,384], [64,9,9,288], [64,9,9,320], [64,17,17,1088], [] and with computed input tensors: input[4] = <3>. 1

3. Steps to reproduce

python legacy/train.py --logtostderr --train_dir=shamik/model_1/train --pipeline_config_path=shamik/model_1/train/faster_rcnn_inception_resnet_v2_atrous_coco.config

My Config file

Faster R-CNN with Inception Resnet v2, Atrous version;

Configured for MSCOCO Dataset.

Users should configure the fine_tune_checkpoint field in the train config as

well as the label_map_path and input_path fields in the train_input_reader and

eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that

should be configured.

model { faster_rcnn { num_classes: 1 image_resizer { keep_aspect_ratio_resizer { min_dimension: 600 max_dimension: 1024 } } feature_extractor { type: 'faster_rcnn_inception_resnet_v2_keras' first_stage_features_stride: 8 } first_stage_anchor_generator { grid_anchor_generator { scales: [0.25, 0.5, 1.0, 2.0] aspect_ratios: [0.5, 1.0, 2.0] height_stride: 8 width_stride: 8 } } first_stage_atrous_rate: 2 first_stage_box_predictor_conv_hyperparams { op: CONV regularizer { l2_regularizer { weight: 0.0 } } initializer { truncated_normal_initializer { stddev: 0.01 } } } first_stage_nms_score_threshold: 0.0 first_stage_nms_iou_threshold: 0.7 first_stage_max_proposals: 300 first_stage_localization_loss_weight: 2.0 first_stage_objectness_loss_weight: 1.0 initial_crop_size: 17 maxpool_kernel_size: 1 maxpool_stride: 1 second_stage_box_predictor { mask_rcnn_box_predictor { use_dropout: false dropout_keep_probability: 1.0 fc_hyperparams { op: FC regularizer { l2_regularizer { weight: 0.0 } } initializer { variance_scaling_initializer { factor: 1.0 uniform: true mode: FAN_AVG } } } } } second_stage_post_processing { batch_non_max_suppression { score_threshold: 0.0 iou_threshold: 0.6 max_detections_per_class: 100 max_total_detections: 300 } score_converter: SOFTMAX } second_stage_localization_loss_weight: 2.0 second_stage_classification_loss_weight: 1.0 } }

train_config: { batch_size: 1 optimizer { momentum_optimizer: { learning_rate: { manual_step_learning_rate { initial_learning_rate: 0.0003 schedule {

        step: 0
        learning_rate: .0003
      }
      schedule {
        step: 900000
        learning_rate: .00003
      }
      schedule {
        step: 1200000
        learning_rate: .000003
      }
    }
  }
  momentum_optimizer_value: 0.9
}
use_moving_average: false

} gradient_clipping_by_norm: 10.0

fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"

from_detection_checkpoint: true

Note: The below line limits the training process to 200K steps, which we

empirically found to be sufficient enough to train the pets dataset. This

effectively bypasses the learning rate schedule (the learning rate will

never decay). Remove the below line to train indefinitely.

num_steps: 200000

data_augmentation_options {

random_horizontal_flip {

}

train_input_reader: { tf_record_input_reader { input_path: "" } label_map_path: "" }

eval_config: { num_examples: 8000

Note: The below line limits the evaluation process to 10 evaluations.

Remove the below line to evaluate indefinitely.

max_evals: 10 }

eval_input_reader: { tf_record_input_reader { input_path: "" } label_map_path: "" shuffle: false num_readers: 1 num_epochs: 1 }

I have specified the correct paths

4. Expected behavior

Training should have carried on smoothly

5. Additional context

Include any logs that would be helpful to diagnose the problem. Traceback (most recent call last): File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in _create_c_op c_op = pywrap_tf_session.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 1 in both shapes must be equal, but are 9 and 17. Shapes are [64,9,9] and [64,17,17]. for '{{node model_1/mixed_7a/concat}} = ConcatV2[N=4, T=DT_FLOAT, Tidx=DT_INT32](model_1/activation_157/Relu, model_1/activation_159/Relu, model_1/activation_162/Relu, model_1/max_pooling2d_3/MaxPool, model_1/mixed_7a/concat/axis)' with input shapes: [64,17,17,384], [64,9,9,288], [64,9,9,320], [64,17,17,1088], [] and with computed input tensors: input[4] = <3>. Traceback (most recent call last): File "/remote/platforms/common/CMLP/workspace/071620-1/python36_cmlp5r/lib/python3.6/site-packages/object_detection/legacy/train.py", line 186, in tf.app.run() File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(*args, kwargs) File "/remote/platforms/common/CMLP/workspace/071620-1/python36_cmlp5r/lib/python3.6/site-packages/object_detection/legacy/train.py", line 182, in main graph_hook_fn=graph_rewriter_fn) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/object_detection/legacy/trainer.py", line 290, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/deployment/model_deploy.py", line 192, in create_clones outputs = model_fn(*args, kwargs) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/object_detection/legacy/trainer.py", line 203, in _create_losses prediction_dict = detection_model.predict(images, true_image_shapes) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 833, in predict true_image_shapes, side_inputs)) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1006, in _predict_second_stage *side_inputs) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1070, in _box_prediction flattened_proposal_feature_maps) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1159, in _extract_box_classifier_features flattened_feature_maps)) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 778, in call outputs = call_fn(cast_inputs, args, kwargs) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 719, in call convert_kwargs_to_constants=base_layer_utils.call_context().saving) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 888, in _run_internal_graph output_tensors = layer(computed_tensors, kwargs) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 778, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/keras/layers/merge.py", line 183, in call return self._merge_function(inputs) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/keras/layers/merge.py", line 522, in _merge_function return K.concatenate(inputs, axis=self.axis) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 2709, in concatenate return array_ops.concat([to_dense(x) for x in tensors], axis) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper return target(args, kwargs) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1606, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1189, in concat_v2 "ConcatV2", values=values, axis=axis, name=name) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3327, in _create_op_internal op_def=op_def) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1817, in init control_input_ops, op_def) File "/remote/platforms/common/CMLP/release/foundation/R202009PRD-1/python36_cmlp6r/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1657, in _create_c_op raise ValueError(str(e)) ValueError: Dimension 1 in both shapes must be equal, but are 9 and 17. Shapes are [64,9,9] and [64,17,17]. for '{{node model_1/mixed_7a/concat}} = ConcatV2[N=4, T=DT_FLOAT, Tidx=DT_INT32](model_1/activation_157/Relu, model_1/activation_159/Relu, model_1/activation_162/Relu, model_1/max_pooling2d_3/MaxPool, model_1/mixed_7a/concat/axis)' with input shapes: [64,17,17,384], [64,9,9,288], [64,9,9,320], [64,17,17,1088], [] and with computed input tensors: input[4] = <3>. 1 return

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): TCSH
Mobile device name if the issue happens on a mobile device:
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below): 2.2
Python version: 3.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

jch1 commented 4 years ago

Hi - our keras models are only meant to be trained/eval'ed with model_main_tf2.py --- it seems that you are using a legacy binary?

shamik111691 commented 4 years ago

Hi @jch1

Thank you for your help. I have tried using model_main_tf2.py and it works. However, it does not print out the loss of each and every step, but instead does it for every 100 steps. Is it possible to print the loss for every step.

Also, earlier I used to use the script export_inference_graph.py and then follow it by a modified version of object_detection_tutorial.ipynb.

What steps do I have to follow for inference now?

sglvladi commented 4 years ago

@shamik111691 The loss logging seems to be controlled by line 640 of model_lib_v2.py: https://github.com/tensorflow/models/blob/e9b70e67a57fe3ab8ad03b8f966069bd0845e64a/research/object_detection/model_lib_v2.py#L640

What you can do to log the loss on every step is change this line to:

if global_step.value() - logged_step >= 1:

and then reinstall the object_detection module (as described here), i.e. assuming you have already compiled the protos:

# from within models/research
cp object_detection/packages/tf2/setup.py .
python -m pip install .

Hope this helps.

PelinSuK commented 3 years ago

@shamik111691 The loss logging seems to be controlled by line 640 of model_lib_v2.py:

https://github.com/tensorflow/models/blob/e9b70e67a57fe3ab8ad03b8f966069bd0845e64a/research/object_detection/model_lib_v2.py#L640

What you can do to log the loss on every step is change this line to:
if global_step.value() - logged_step >= 1:
and then reinstall the object_detection module (as described here), i.e. assuming you have already compiled the protos:
# from within models/research
cp object_detection/packages/tf2/setup.py .
python -m pip install .
Hope this helps.

Hello I have a question im using tensorflow 2.5 python 3.6. But while executing model_main_tf2.py " _python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2pets.config " im having some errors. These are = File "C:\tensorflow1\models\research\object_detection\model_main_tf2.py", line 115, in tf.compat.v1.app.run() File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\absl\app.py", line 303, in run _run_main(main, args) File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\absl\app.py", line 251, in _run_main sys.exit(main(argv)) File "C:\tensorflow1\models\research\object_detection\model_main_tf2.py", line 106, in main model_lib_v2.train_loop( File "C:\tensorflow1\models\research\object_detection\model_lib_v2.py", line 524, in train_loop raise ValueError('train_pb2.load_all_detection_checkpoint_vars ' ValueError: train_pb2.load_all_detection_checkpoint_vars unsupported in TF2

I searched for 4 days to solve this and couldnt find any solution. Can you help me as well?

tensorflow / models

faster_rcnn_inception_resnet_v2_keras fails in TF2 #8963