Closed JunHyungKang closed 1 year ago
Hi @JunHyungKang ,
Sorry for delay in response, In order to expedite the trouble-shooting process, please provide a code snippet/colab code to reproduce the issue reported here. Thanks.
@laxmareddyp Here is my config and detail error.
{'runtime': {'all_reduce_alg': None,
'batchnorm_spatial_persistent': False,
'dataset_num_private_threads': None,
'default_shard_dim': -1,
'distribution_strategy': 'mirrored',
'enable_xla': False,
'gpu_thread_mode': None,
'loss_scale': None,
'mixed_precision_dtype': None,
'num_cores_per_replica': 1,
'num_gpus': 2,
'num_packs': 1,
'per_gpu_thread_count': 0,
'run_eagerly': False,
'task_index': -1,
'tpu': None,
'tpu_enable_xla_dynamic_padder': None,
'worker_hosts': None},
'task': {'allow_image_summary': False,
'annotation_file': None,
'differential_privacy_config': None,
'export_config': {'cast_detection_classes_to_float': False,
'cast_num_detections_to_float': False,
'output_intermediate_features': False,
'output_normalized_coordinates': False},
'freeze_backbone': False,
'init_checkpoint': None,
'init_checkpoint_modules': 'all',
'losses': {'box_loss_weight': 50,
'focal_loss_alpha': 0.25,
'focal_loss_gamma': 1.5,
'huber_loss_delta': 0.1,
'l2_weight_decay': 0.0001,
'loss_weight': 1.0},
'max_num_eval_detections': 100,
'model': {'anchor': {'anchor_size': 4.0,
'aspect_ratios': [0.5, 1.0, 2.0],
'num_scales': 3},
'backbone': {'resnet': {'bn_trainable': True,
'depth_multiplier': 1.0,
'model_id': 50,
'replace_stem_max_pool': False,
'resnetd_shortcut': False,
'scale_stem': True,
'se_ratio': 0.0,
'stem_type': 'v0',
'stochastic_depth_drop_rate': 0.0},
'type': 'resnet'},
'decoder': {'fpn': {'fusion_type': 'sum',
'num_filters': 256,
'use_keras_layer': False,
'use_separable_conv': False},
'type': 'fpn'},
'detection_generator': {'apply_nms': True,
'max_num_detections': 100,
'nms_iou_threshold': 0.5,
'nms_version': 'v2',
'pre_nms_score_threshold': 0.05,
'pre_nms_top_k': 5000,
'return_decoded': None,
'soft_nms_sigma': None,
'tflite_post_processing': {'max_classes_per_detection': 5,
'max_detections': 200,
'nms_iou_threshold': 0.5,
'nms_score_threshold': 0.1,
'normalize_anchor_coordinates': False,
'use_regular_nms': False},
'use_class_agnostic_nms': False,
'use_cpu_nms': False},
'head': {'attribute_heads': [],
'num_convs': 4,
'num_filters': 256,
'share_classification_heads': False,
'use_separable_conv': False},
'input_size': [1024, 1024, 3],
'max_level': 7,
'min_level': 3,
'norm_activation': {'activation': 'relu',
'norm_epsilon': 0.001,
'norm_momentum': 0.99,
'use_sync_bn': True},
'num_classes': 1},
'name': None,
'per_category_metrics': False,
'train_data': {'apply_tf_data_service_before_batching': False,
'block_length': 1,
'cache': False,
'cycle_length': None,
'decoder': {'simple_decoder': {'attribute_names': [],
'mask_binarize_threshold': None,
'regenerate_source_id': False},
'type': 'simple_decoder'},
'deterministic': None,
'drop_remainder': True,
'dtype': 'float32',
'enable_shared_tf_data_service_between_parallel_trainers': False,
'enable_tf_data_service': False,
'file_type': 'tfrecord',
'global_batch_size': 2,
'input_path': '/mnt/sda1/exc_cctv/1st/tfrecords/train/*',
'is_training': True,
'parser': {'aug_policy': None,
'aug_rand_hflip': False,
'aug_scale_max': 1.2,
'aug_scale_min': 0.8,
'aug_type': None,
'match_threshold': 0.5,
'max_num_instances': 100,
'num_channels': 3,
'skip_crowd_during_training': True,
'unmatched_threshold': 0.5},
'prefetch_buffer_size': None,
'seed': None,
'sharding': True,
'shuffle_buffer_size': 10000,
'tf_data_service_address': None,
'tf_data_service_job_name': None,
'tfds_as_supervised': False,
'tfds_data_dir': '',
'tfds_name': '',
'tfds_skip_decoding_feature': '',
'tfds_split': '',
'trainer_id': None,
'weights': None},
'use_coco_metrics': True,
'use_wod_metrics': False,
'validation_data': {'apply_tf_data_service_before_batching': False,
'block_length': 1,
'cache': False,
'cycle_length': None,
'decoder': {'simple_decoder': {'attribute_names': [],
'mask_binarize_threshold': None,
'regenerate_source_id': False},
'type': 'simple_decoder'},
'deterministic': None,
'drop_remainder': True,
'dtype': 'float32',
'enable_shared_tf_data_service_between_parallel_trainers': False,
'enable_tf_data_service': False,
'file_type': 'tfrecord',
'global_batch_size': 2,
'input_path': '/mnt/sda1/exc_cctv/1st/tfrecords/val/*',
'is_training': False,
'parser': {'aug_policy': None,
'aug_rand_hflip': False,
'aug_scale_max': 1.0,
'aug_scale_min': 1.0,
'aug_type': None,
'match_threshold': 0.5,
'max_num_instances': 100,
'num_channels': 3,
'skip_crowd_during_training': True,
'unmatched_threshold': 0.5},
'prefetch_buffer_size': None,
'seed': None,
'sharding': True,
'shuffle_buffer_size': 10000,
'tf_data_service_address': None,
'tf_data_service_job_name': None,
'tfds_as_supervised': False,
'tfds_data_dir': '',
'tfds_name': '',
'tfds_skip_decoding_feature': '',
'tfds_split': '',
'trainer_id': None,
'weights': None}},
'trainer': {'allow_tpu_summary': False,
'best_checkpoint_eval_metric': '',
'best_checkpoint_export_subdir': 'best',
'best_checkpoint_metric_comp': 'higher',
'checkpoint_interval': 9744,
'continuous_eval_timeout': 3600,
'eval_tf_function': True,
'eval_tf_while_loop': False,
'loss_upper_bound': 1000000.0,
'max_to_keep': 5,
'optimizer_config': {'ema': None,
'learning_rate': {'stepwise': {'boundaries': [555408,
652848],
'name': 'PiecewiseConstantDecay',
'offset': 0,
'values': [0.0025,
0.00025,
2.5e-05]},
'type': 'stepwise'},
'optimizer': {'sgd': {'clipnorm': None,
'clipvalue': None,
'decay': 0.0,
'global_clipnorm': None,
'momentum': 0.9,
'name': 'SGD',
'nesterov': False},
'type': 'sgd'},
'warmup': {'linear': {'name': 'linear',
'warmup_learning_rate': 0.0067,
'warmup_steps': 500},
'type': 'linear'}},
'preemption_on_demand_checkpoint': True,
'recovery_begin_steps': 0,
'recovery_max_trials': 0,
'steps_per_loop': 9744,
'summary_interval': 9744,
'train_steps': 701568,
'train_tf_function': True,
'train_tf_while_loop': True,
'validation_interval': 9744,
'validation_steps': -1,
'validation_summary_subdir': 'validation'}}
I0316 12:00:10.691335 139750881177728 controller.py:502] train | step: 253344 | steps/sec: 3.8 | output:
{'box_loss': 0.0017456077,
'cls_loss': 6.524426e-06,
'learning_rate': 0.0025,
'model_loss': 0.08728693,
'total_loss': 0.2591449,
'training_loss': 0.2591449}
train | step: 253344 | steps/sec: 3.8 | output:
{'box_loss': 0.0017456077,
'cls_loss': 6.524426e-06,
'learning_rate': 0.0025,
'model_loss': 0.08728693,
'total_loss': 0.2591449,
'training_loss': 0.2591449}
I0316 12:00:11.730173 139750881177728 controller.py:531] saved checkpoint to /mnt/sda1/exc_cctv/results_retinanet/ckpt-253344.
saved checkpoint to /mnt/sda1/exc_cctv/results_retinanet/ckpt-253344.
I0316 12:00:11.730983 139750881177728 controller.py:297] eval | step: 253344 | running complete evaluation...
eval | step: 253344 | running complete evaluation...
INFO:tensorflow:Error reported to Coordinator: Exception encountered when calling layer 'retina_net_model' (type RetinaNetModel).
in user code:
File "/home/vision/Models/models/official/vision/modeling/retinanet_model.py", line 169, in call *
final_results = self.detection_generator(raw_boxes, raw_scores,
File "/home/vision/Models/models/official/vision/modeling/layers/detection_generator.py", line 1512, in __call__ *
(nmsed_boxes, nmsed_scores, nmsed_classes, valid_detections) = (
File "/home/vision/Models/models/official/vision/modeling/layers/detection_generator.py", line 588, in _generate_detections_v2 *
return _generate_detections_v2_class_aware(
File "/home/vision/Models/models/official/vision/modeling/layers/detection_generator.py", line 518, in _generate_detections_v2_class_aware *
nmsed_boxes = tf.concat(nmsed_boxes, axis=1)
ValueError: List argument 'values' to 'ConcatV2' Op with length 0 shorter than minimum length 2.
Call arguments received by layer 'retina_net_model' (type RetinaNetModel):
• images=tf.Tensor(shape=(1, 1024, 1024, 3), dtype=float32)
• image_shape=tf.Tensor(shape=(1, 2), dtype=float32)
• anchor_boxes={'3': 'tf.Tensor(shape=(1, 128, 128, 36), dtype=float32)', '4': 'tf.Tensor(shape=(1, 64, 64, 36), dtype=float32)', '5': 'tf.Tensor(shape=(1, 32, 32, 36), dtype=float32)', '6': 'tf.Tensor(shape=(1, 16, 16, 36), dtype=float32)', '7': 'tf.Tensor(shape=(1, 8, 8, 36), dtype=float32)'}
• output_intermediate_features=False
• training=False
Traceback (most recent call last):
File "/home/vision/anaconda3/envs/tfm/lib/python3.8/site-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
yield
File "/home/vision/anaconda3/envs/tfm/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 386, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/__autograph_generated_filebc32f0c3.py", line 17, in step_fn
logs = ag__.converted_call(ag__.ld(self).task.validation_step, (ag__.ld(inputs),), dict(model=ag__.ld(self).model, metrics=ag__.ld(self).validation_metrics), fscope_1)
File "/home/vision/anaconda3/envs/tfm/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 439, in converted_call
result = converted_f(*effective_args, **kwargs)
File "/tmp/__autograph_generated_filefxbwta_h.py", line 12, in tf__validation_step
outputs = ag__.converted_call(ag__.ld(model), (ag__.ld(features),), dict(anchor_boxes=ag__.ld(labels)['anchor_boxes'], image_shape=ag__.ld(labels)['image_info'][:, 1, :], training=False), fscope)
File "/home/vision/anaconda3/envs/tfm/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 331, in converted_call
return _call_unconverted(f, args, kwargs, options, False)
File "/home/vision/anaconda3/envs/tfm/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 458, in _call_unconverted
return f(*args, **kwargs)
File "/home/vision/anaconda3/envs/tfm/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/tmp/__autograph_generated_fileiwd58oi2.py", line 228, in tf__call
ag__.if_stmt(ag__.ld(training), if_body_11, else_body_11, get_state_12, set_state_12, ('do_return', 'final_results', 'retval_', 'anchor_boxes'), 3)
File "/tmp/__autograph_generated_fileiwd58oi2.py", line 156, in else_body_11
final_results = ag__.converted_call(ag__.ld(self).detection_generator, (ag__.ld(raw_boxes), ag__.ld(raw_scores), ag__.ld(anchor_boxes), ag__.ld(image_shape), ag__.ld(raw_attributes)), None, fscope)
File "/tmp/__autograph_generated_filepm9g43ot.py", line 213, in tf____call__
ag__.if_stmt(ag__.and_((lambda : ag__.ld(self)._config_dict['apply_nms']), (lambda : (ag__.ld(self)._config_dict['nms_version'] == 'tflite'))), if_body_8, else_body_8, get_state_8, set_state_8, ('do_return', 'retval_'), 2)
File "/tmp/__autograph_generated_filepm9g43ot.py", line 200, in else_body_8
ag__.if_stmt(ag__.not_(ag__.ld(self)._config_dict['apply_nms']), if_body_7, else_body_7, get_state_7, set_state_7, ('do_return', 'retval_'), 2)
File "/tmp/__autograph_generated_filepm9g43ot.py", line 185, in else_body_7
ag__.if_stmt((ag__.ld(self)._config_dict['nms_version'] == 'batched'), if_body_6, else_body_6, get_state_6, set_state_6, ('nmsed_attributes', 'nmsed_boxes', 'nmsed_classes', 'nmsed_scores', 'valid_detections'), 5)
File "/tmp/__autograph_generated_filepm9g43ot.py", line 179, in else_body_6
ag__.if_stmt((ag__.ld(self)._config_dict['nms_version'] == 'v1'), if_body_5, else_body_5, get_state_5, set_state_5, ('nmsed_attributes', 'nmsed_boxes', 'nmsed_classes', 'nmsed_scores', 'valid_detections'), 5)
File "/tmp/__autograph_generated_filepm9g43ot.py", line 173, in else_body_5
ag__.if_stmt((ag__.ld(self)._config_dict['nms_version'] == 'v2'), if_body_4, else_body_4, get_state_4, set_state_4, ('nmsed_attributes', 'nmsed_boxes', 'nmsed_classes', 'nmsed_scores', 'valid_detections'), 5)
File "/tmp/__autograph_generated_filepm9g43ot.py", line 141, in if_body_4
(nmsed_boxes, nmsed_scores, nmsed_classes, valid_detections) = ag__.converted_call(ag__.ld(_generate_detections_v2), (ag__.ld(boxes), ag__.ld(scores)), dict(pre_nms_top_k=ag__.ld(self)._config_dict['pre_nms_top_k'], pre_nms_score_threshold=ag__.ld(self)._config_dict['pre_nms_score_threshold'], nms_iou_threshold=ag__.ld(self)._config_dict['nms_iou_threshold'], max_num_detections=ag__.ld(self)._config_dict['max_num_detections'], use_class_agnostic_nms=ag__.ld(self)._config_dict['use_class_agnostic_nms']), fscope)
File "/tmp/__autograph_generated_file0tsux9fz.py", line 36, in tf___generate_detections_v2
ag__.if_stmt(ag__.ld(use_class_agnostic_nms), if_body, else_body, get_state, set_state, ('do_return', 'retval_'), 2)
File "/tmp/__autograph_generated_file0tsux9fz.py", line 32, in else_body
retval_ = ag__.converted_call(ag__.ld(_generate_detections_v2_class_aware), (), dict(boxes=ag__.ld(boxes), scores=ag__.ld(scores), pre_nms_top_k=ag__.ld(pre_nms_top_k), pre_nms_score_threshold=ag__.ld(pre_nms_score_threshold), nms_iou_threshold=ag__.ld(nms_iou_threshold), max_num_detections=ag__.ld(max_num_detections)), fscope)
File "/tmp/__autograph_generated_file76u_22bw.py", line 60, in tf___generate_detections_v2_class_aware
nmsed_boxes = ag__.converted_call(ag__.ld(tf).concat, (ag__.ld(nmsed_boxes),), dict(axis=1), fscope)
ValueError: Exception encountered when calling layer 'retina_net_model' (type RetinaNetModel).
@laxmareddyp
same issue with colab env.
notebook: https://colab.research.google.com/drive/1hzqvTr_rv9w0nM6jGniCWe05Les9jbKp?usp=sharing
Hi @JunHyungKang,
Thanks for providing the colab code. We suggest you to once go through this tutorial how to configure a object detection pipeline. I see that you have included a function in retinanet.py
to register it. Instead of that you can load the experiment configuration using this line exp_config = exp_factory.get_exp_config('retinanet_resnetfpn_coco')
. Which will load the configuration required for retinanet_resnetfpn_coco
.
Now you can access the experiment configuration like an object and change all required variables in the configuration and use it for training your model with custom dataset and custom configuration. Please use distribution_strategy.scope()
so that it will take care of the distribution strategy.
Please check this gist which gives a glimpse how you can modify the configuration and make use of it to fuller extent.
There may be new commits in git clone which are not yet in stable release, we suggest you to you use pip install so that if there are any on going errors/bugs, they will not be any problems to your training. I hope this will help you resolve the issue.
Thanks
Hi @laxmareddyp ,
Thank you for your response. However, the notebook file you provided is no different from the train.py file that I worked on and ran immediately. I am using the stable commit version of the master branch, and all the items you executed in the notebook are coded to run in the same order in train.py.
I have attached the results from running the notebook you provided in the same environment for reference. In my opinion, as I mentioned earlier, the NMS code may need to be modified to consider cases where there are no predicted bounding boxes, but I haven't been able to look deeper into it. Please check the attached results.
https://colab.research.google.com/drive/1XJma_dqu4RgWk-dODd03sm5n6WlToxii?usp=sharing
Hi @JunHyungKang ,
Yes, its no different from train.py just wanted to write without creating a function. Now I understood the problem, basically NMS is getting null boxes after prediction which has to be min of two length for concat, we will look into it internal and come back with proper resolution. Really thanks for reporting the bug.
Thanks.
Hi @JunHyungKang,
Can you please tell me what is the class number that has been declared in the tfrecords
. Also if possible some tfrecords
which are dummy and similar to your data would really help us to reproduce the error from our side. I was trying to reproduce with other dataset but not able to do it.
Thanks
@laxmareddyp I used one class to generate tfrecords. please refer to this dummy.
Hi @JunHyungKang,
Please make sure to follow below requirements while creating tfrecords
.
'image/encoded' | bytes | Required. The encoded image bytes. |
---|---|---|
'image/source_id' | string | Required. The unique identifier of the image, need to be an number in string. |
'image/height' | integer | Optional. The height of the image. If not exisit, inferred from image. |
'image/width' | integer | Optional. The width of the image. If not exisit, inferred from image. |
'image/object/bbox/xmin' | a list of float | Required. The normalized xmin coordinates of all the instances. |
'image/object/bbox/xmax' | a list of float | Required. The normalized xmax coordinates of all the instances. |
'image/object/bbox/ymin' | a list of float | Required. The normalized ymin coordinates of all the instances. |
'image/object/bbox/ymax' | a list of float | Required. The normalized ymax coordinates of all the instances. |
'image/object/class/label' | a list of integer | Required. The class indices of all the instances. Note that 0 is reserved for background. |
'image/object/mask' | a list of bytes | Optional. The mask of all the instances in PNG format. |
'image/object/area' | a list of float | Optional. The area of all the instances. If not exisit, derived from the bounding boxes. |
'image/object/is_crowd' | a list of integer | Optional. The 0/1 integers to denote whether instances are a crowd. The crowd instance get special treatement during the evaluation. 0 (not crowd) by default. |
The dummy records you provided has a error while evaluation because the source_id has string data with alphbatic characters in it. Please find the gist which I am trying to debug.
Also add the classes as 2 because the classes start from 0 in general in the cooc_json. Which can be considered as background if I am not wrong. Please check the below screenshot from BCCD dataset.
Please go through this object detection tutorial if you have not gone through, it uses BCCD dataset and trains the model. Still if the error persists, once you are ready with proper tfrecords format, we are happy to debug and further help you to resolve the issue.
Thanks
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
This issue was closed due to lack of activity after being marked stale for past 7 days.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/official/vision/modeling/layers/detection_generator.py
2. Describe the bug
In function '_generate_detections_v2_class_aware', error is occured during evaludation phase as follow: ValueError: List argument 'values' to 'ConcatV2' Op with length 0 shorter than minimum length 2.
3. Steps to reproduce
pass the input with length 0 tensors as predicted boxes
4. Expected behavior
passing this function when the predicted boxes is 0 or 1
5. Additional context
N/A
6. System information