Closed anshkumar closed 3 years ago
This is strange. Can you
1) tf version is 2.4.1
2) After clearing model_dir, it working. But when specifying fine_tune_checkpoint_type
as "detection" or "fine_tune", it's failing.
Can you cleanup the directory, run with "fine_tune" and tell us what error you are getting ?
After clearing the model_dir (only having ckpt-0.data-00000-of-00001
and ckpt-0.index
), "fine_tune" option is working fine. But when using "detection" I''m getting a long list of error.
model_dir should be completely empty before starting the training job for the first time. And fine tune checkpoint should be stored in a different directory which is not model_dir.
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_training_and_evaluation.md#recommended-directory-structure-for-training-and-evaluation The page above explains this. Note that everything in model dir is created by the training/evaluation jobs.
Even after doing that I'm getting following error for "detection":
Traceback (most recent call last):
File "/home/deploy/models/research/object_detection/model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/deploy/models/research/object_detection/model_main_tf2.py", line 110, in main
record_summaries=FLAGS.record_summaries)
File "/home/deploy/models/research/object_detection/model_lib_v2.py", line 597, in train_loop
train_input, unpad_groundtruth_tensors)
File "/home/deploy/models/research/object_detection/model_lib_v2.py", line 398, in load_fine_tune_checkpoint
ckpt.restore(checkpoint_path).assert_existing_objects_matched()
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 810, in assert_existing_objects_matched
(list(unused_python_objects),))
AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program: [MirroredVariable:{
0: <tf.Variable 'center_net_hourglass_feature_extractor/hourglass_network/encoder_decoder_block_5/encoder_decoder_block_6/encoder_decoder_block_7/encoder_decoder_block_8/encoder_decoder_block_9/residual_block_58/convolutional_block_60/batchnorm/gamma:0' shape=(512,) dtype=float32, numpy=
...
Here is the complete error.
"detection" is not designed to work with this use case. "detection" with Centernet is only currently supported from the extreme net checkpoint in the model zoo.
Since "fine_tune" is working, I am closing this bug because that is the intended behavior.
Please note that this commit https://github.com/tensorflow/models/commit/aa3e639f80c2967504310b0f578f0f00063a8aff
Consolidates "fine_tune" and "detection" types into just "detection". Now all TF2 models only support 3 different types "detection", "classification" and "full"
I tried using "detection" again with the latest pull, but with the pre-train checkpoints I'm getting following error:
Traceback (most recent call last):
File "/home/deploy/models/research/object_detection/model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/deploy/models/research/object_detection/model_main_tf2.py", line 110, in main
record_summaries=FLAGS.record_summaries)
File "/home/deploy/models/research/object_detection/model_lib_v2.py", line 598, in train_loop
train_input, unpad_groundtruth_tensors)
File "/home/deploy/models/research/object_detection/model_lib_v2.py", line 400, in load_fine_tune_checkpoint
ckpt.restore(checkpoint_path).assert_existing_objects_matched()
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 1776, in restore
status = self._saver.restore(save_path=save_path)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 1339, in restore
checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py", line 258, in restore
restore_ops = trackable._restore_from_checkpoint_position(self) # pylint: disable=protected-access
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py", line 978, in _restore_from_checkpoint_position
tensor_saveables, python_saveables))
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 309, in restore_saveables
validated_saveables).restore(self.save_path_tensor, self.options)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py", line 339, in restore
restore_ops = restore_fn()
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py", line 323, in restore_fn
restore_ops.update(saver.restore(file_prefix, options))
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py", line 116, in restore
restored_tensors, restored_shapes=None)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 1079, in restore
tensor)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/distribute/values_util.py", line 96, in get_on_write_restore_ops
for v in var.values))
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/distribute/values_util.py", line 96, in <genexpr>
for v in var.values))
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/distribute/values_util.py", line 302, in assign_on_device
return variable.assign(tensor)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 901, in assign
(tensor_name, self._shape, value_tensor.shape))
ValueError: Cannot assign to variable center_net_hourglass_feature_extractor/hourglass_network/input_downsample_block/convolutional_block/conv2d/kernel:0 due to variable shape (7, 7, 3, 64) and value shape (7, 7, 3, 128) are incompatible
Will the pre-train only work with 1024x1024 and "Hourglass-100" mask head ?
Can you tell us which checkpoint you are trying this with ? And also share the full stack trace of the error log and the config file.
@anshkumar Pre-training should work with all mask heads.
I'm using the checkpoints provided in the document here. Here is my config:
# DeepMAC meta architecture from the "The surprising impact of mask-head
# architecture on novel class segmentation" [1] paper with an Hourglass-100[2]
# mask head. This config is trained on all COCO classes and achieves a
# mask mAP of 39.4% on the COCO testdev-2017 set.
# [1]: https://arxiv.org/abs/2104.00613
# [2]: https://arxiv.org/abs/1904.07850
# Train on TPU-128
model {
center_net {
num_classes: 5
feature_extractor {
type: "hourglass_52"
bgr_ordering: true
channel_means: [104.01362025, 114.03422265, 119.9165958 ]
channel_stds: [73.6027665 , 69.89082075, 70.9150767 ]
}
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 768
max_dimension: 768
pad_to_max_dimension: true
}
}
object_detection_task {
task_loss_weight: 1.0
offset_loss_weight: 1.0
scale_loss_weight: 0.1
localization_loss {
l1_localization_loss {
}
}
}
object_center_params {
object_center_loss_weight: 1.0
min_box_overlap_iou: 0.7
max_box_predictions: 2000
classification_loss {
penalty_reduced_logistic_focal_loss {
alpha: 2.0
beta: 4.0
}
}
}
deepmac_mask_estimation {
dim: 32
task_loss_weight: 5.0
pixel_embedding_dim: 16
mask_size: 32
use_xy: true
use_instance_embedding: true
network_type: "hourglass20"
classification_loss {
weighted_sigmoid {}
}
}
}
}
train_config: {
batch_size: 4
num_steps: 50000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_adjust_hue {
}
}
data_augmentation_options {
random_adjust_contrast {
}
}
data_augmentation_options {
random_adjust_saturation {
}
}
data_augmentation_options {
random_adjust_brightness {
}
}
#data_augmentation_options {
# random_square_crop_by_scale {
# scale_min: 0.6
# scale_max: 1.3
# }
# }
optimizer {
adam_optimizer: {
epsilon: 1e-7 # Match tf.keras.optimizers.Adam's default.
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: 1e-3
total_steps: 50000
warmup_learning_rate: 2.5e-4
warmup_steps: 5000
}
}
}
use_moving_average: false
}
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_version: V2
fine_tune_checkpoint: "/home/deploy/ved/deepmac_1024x1024_coco17/pre-train/ckpt-0"
fine_tune_checkpoint_type: "detection"
}
train_input_reader: {
load_instance_masks: true
label_map_path: "/home/deploy/ved/pfg/l2/sort/label_map_potato_l2.pbtxt"
mask_type: PNG_MASKS
tf_record_input_reader {
input_path: "/home/deploy/ved/pfg/l2/sort/sort_train.record"
}
}
eval_config: {
metrics_set: "coco_detection_metrics"
metrics_set: "coco_mask_metrics"
include_metrics_per_category: true
use_moving_averages: false
batch_size: 1;
}
eval_input_reader: {
load_instance_masks: true
mask_type: PNG_MASKS
label_map_path: "/home/deploy/ved/pfg/l2/sort/label_map_potato_l2.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "/home/deploy/ved/pfg/l2/sort/sort_val.record"
}
}
Here is the full log.
Also, with this config I'm not able to get any mask loss in the tensorboard (when trained without any pre-train checkpoints).
This is happening because you are using the hourglass52
feature extractor.
Hourglass-52 uses 64 channels in its first layers https://github.com/tensorflow/models/blob/master/research/object_detection/models/center_net_hourglass_feature_extractor.py#L100
Where as hourglass104 uses 128 channels https://github.com/tensorflow/models/blob/master/research/object_detection/models/center_net_hourglass_feature_extractor.py#L106
We don't support altering the number of channels right now. My suggestion would be to try and use the hourglass104
feature extractor.
Thanks for the clarification. But why am I not getting mask loss ? Also, during validation it was showing key error of "detection_masks". Here is a temporary tensorboard.
Oh, that might be a bug. Let me investigate.
@anshkumar This commit should fix the issue. https://github.com/tensorflow/models/commit/8b45de4ffc7eb8d66f0139ee1f62e699ee401072
@vighneshbirodkar it's missing an import:
from object_detection.meta_architectures import deepmac_meta_arch
Fixed via https://github.com/tensorflow/models/commit/441f14a6aac221406aeb98c96df3ef3d0c3752f9
I also added a test.
@vighneshbirodkar during validation, I'm getting following errror:
Traceback (most recent call last):
File "/home/deploy/models/research/object_detection/model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/deploy/miniconda3/envs/tensorflow/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/deploy/models/research/object_detection/model_main_tf2.py", line 88, in main
wait_interval=300, timeout=FLAGS.eval_timeout)
File "/home/deploy/models/research/object_detection/model_lib_v2.py", line 1139, in eval_continuously
global_step=global_step,
File "/home/deploy/models/research/object_detection/model_lib_v2.py", line 984, in eager_eval_loop
eval_metrics.update(evaluator.evaluate())
File "/home/deploy/models/research/object_detection/metrics/coco_evaluation.py", line 307, in evaluate
super_categories=self._super_categories)
File "/home/deploy/models/research/object_detection/metrics/coco_tools.py", line 305, in ComputeMetrics
raise ValueError('Category stats do not exist')
ValueError: Category stats do not exist
I'm training on custom dataset using DeepMAC. The config is as follows:
But I'm getting following error: