Open turowicz opened 4 years ago
@vighneshbirodkar this is the problem
Can you inspect the contents of your model dir (argument passed as --model_dir
) and paste the contents here. And also the content of the checkpoint file. It would also be helpful to see the entire config file you are using.
@vighneshbirodkar
I download the model from the following URL and don't change anything: http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
I'm running on a GPU Nvidia V100.
My config is
# SSD with EfficientNet-b1 + BiFPN feature extractor,
# shared box predictor and focal loss (a.k.a EfficientDet-d1).
# See EfficientDet, Tan et al, https://arxiv.org/abs/1911.09070
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from an EfficientNet-b1 checkpoint.
#
# Train on TPU-8
model {
ssd {
num_classes: 7
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 640
max_dimension: 640
pad_to_max_dimension: true
}
}
feature_extractor {
type: "ssd_efficientnet-b1_bifpn_keras"
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.029999999329447746
}
}
activation: SWISH
batch_norm {
decay: 0.9900000095367432
scale: true
epsilon: 0.0010000000474974513
}
force_use_bias: true
}
bifpn {
min_level: 3
max_level: 7
num_iterations: 4
num_filters: 88
}
}
box_coder {
faster_rcnn_box_coder {
y_scale: 1.0
x_scale: 1.0
height_scale: 1.0
width_scale: 1.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
similarity_calculator {
iou_similarity {
}
}
box_predictor {
weight_shared_convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.009999999776482582
}
}
activation: SWISH
batch_norm {
decay: 0.9900000095367432
scale: true
epsilon: 0.0010000000474974513
}
force_use_bias: true
}
depth: 88
num_layers_before_predictor: 3
kernel_size: 3
class_prediction_bias_init: -4.599999904632568
use_depthwise: true
}
}
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 4.0
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
scales_per_octave: 3
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 9.99999993922529e-09
iou_threshold: 0.5
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
}
}
classification_loss {
weighted_sigmoid_focal {
gamma: 1.5
alpha: 0.25
}
}
classification_weight: 1.0
localization_weight: 1.0
}
encode_background_as_zeros: true
normalize_loc_loss_by_codesize: true
inplace_batchnorm_update: true
freeze_batchnorm: false
add_background_class: false
}
}
train_config {
batch_size: 8
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_scale_crop_and_pad_to_square {
output_size: 640
scale_min: 0.10000000149011612
scale_max: 2.0
}
}
sync_replicas: true
optimizer {
momentum_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.07999999821186066
total_steps: 300000
warmup_learning_rate: 0.0010000000474974513
warmup_steps: 2500
}
}
momentum_optimizer_value: 0.8999999761581421
}
use_moving_average: false
}
fine_tune_checkpoint: "pre-trained-model/checkpoint/ckpt-0"
num_steps: 300000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
use_bfloat16: true
fine_tune_checkpoint_version: V2
}
train_input_reader: {
label_map_path: "annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "annotations/train.record"
}
}
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1;
}
eval_input_reader: {
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "annotations/test.record"
}
}
If you are loading from a pre-trained checkpoint, these warnings are expected. According to the code here we are just loading the weights for _feature_extractor
. We are not loading weights for _box_predictor
because it lets you change the box prediction parameters according to your application. Tensorflow is just warning us about the fact that certain weights in the checkpoint are not used.
With that said, in spite of these warnings, training should be able to resume correctly. Can you attach the full logs of your training run ? I would need to see 2 logs, the first one in which you train from scratch and the second one from when you are resuming a training job.
@vighneshbirodkar
Full logs attached: run-1.log run-2.log
Below you can see how the run number 2 doesn't pick up the loss, but starts from scratch. The step numbers are in line though.
Run 1 Loss:
INFO:tensorflow:Step 100 per-step time 0.767s loss=0.832
I0911 16:12:03.421945 140510505338688 model_lib_v2.py:652] Step 100 per-step time 0.767s loss=0.832
INFO:tensorflow:Step 200 per-step time 0.836s loss=0.557
I0911 16:13:23.431074 140510505338688 model_lib_v2.py:652] Step 200 per-step time 0.836s loss=0.557
INFO:tensorflow:Step 300 per-step time 0.855s loss=0.635
I0911 16:14:43.690119 140510505338688 model_lib_v2.py:652] Step 300 per-step time 0.855s loss=0.635
INFO:tensorflow:Step 400 per-step time 0.803s loss=0.745
I0911 16:16:03.250435 140510505338688 model_lib_v2.py:652] Step 400 per-step time 0.803s loss=0.745
INFO:tensorflow:Step 500 per-step time 0.794s loss=0.560
I0911 16:17:22.130386 140510505338688 model_lib_v2.py:652] Step 500 per-step time 0.794s loss=0.560
INFO:tensorflow:Step 600 per-step time 0.766s loss=0.475
I0911 16:18:41.020164 140510505338688 model_lib_v2.py:652] Step 600 per-step time 0.766s loss=0.475
INFO:tensorflow:Step 700 per-step time 0.830s loss=0.553
I0911 16:20:00.191405 140510505338688 model_lib_v2.py:652] Step 700 per-step time 0.830s loss=0.553
INFO:tensorflow:Step 800 per-step time 0.751s loss=0.389
I0911 16:21:19.443584 140510505338688 model_lib_v2.py:652] Step 800 per-step time 0.751s loss=0.389
INFO:tensorflow:Step 900 per-step time 0.793s loss=0.411
I0911 16:22:38.986910 140510505338688 model_lib_v2.py:652] Step 900 per-step time 0.793s loss=0.411
INFO:tensorflow:Step 1000 per-step time 0.837s loss=0.455
I0911 16:23:57.950603 140510505338688 model_lib_v2.py:652] Step 1000 per-step time 0.837s loss=0.455
^C
Run 2 Loss:
INFO:tensorflow:Step 1100 per-step time 0.794s loss=0.805
I0911 16:28:03.628168 139671625799488 model_lib_v2.py:652] Step 1100 per-step time 0.794s loss=0.805
INFO:tensorflow:Step 1200 per-step time 0.783s loss=0.633
I0911 16:29:23.007995 139671625799488 model_lib_v2.py:652] Step 1200 per-step time 0.783s loss=0.633
INFO:tensorflow:Step 1300 per-step time 0.785s loss=0.781
I0911 16:30:42.642542 139671625799488 model_lib_v2.py:652] Step 1300 per-step time 0.785s loss=0.781
INFO:tensorflow:Step 1400 per-step time 0.785s loss=0.705
I0911 16:32:02.760208 139671625799488 model_lib_v2.py:652] Step 1400 per-step time 0.785s loss=0.705
INFO:tensorflow:Step 1500 per-step time 0.762s loss=0.548
I0911 16:33:21.996925 139671625799488 model_lib_v2.py:652] Step 1500 per-step time 0.762s loss=0.548
^C
I understand the issue now. I will do a bit more investigating.
@vighneshbirodkar @saikumarchalla @gowthamkpr any ideas?
I also trained EfficientDet D4 with pre-trained weights with my own dataset. Before restarting the training, I changed fine_tune_checkpoint argument value to an already trained weight path in the pipeline_config file. this is the repo that I am using. https://github.com/jahongir7174/EfficientDet Sorry if I understand wrongly
@jahongir7174 It probably will work but in TF 1.x it picked the newly created checkpoints automatically.
I was thinking of opening a similar issue to this but thank God I didn't. I also trained an EfficientDet d0 model and stopped after getting a 0.6 loss value. I then decided I will train it further and so edited the config file to point to the latest checkpoint file. However, when using that checkpoint for training it seems the model forgets everything and starts re-learning. takes about the same time and steps as before to converge to 0.6.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/research/object_detection/model_main_tf2.py
2. Describe the bug
Contrary to TF1.x, in TF2.x when I stop training after a checkpoint, run evaluation and restart training the result is the model starts learning from scratch.
3. Steps to reproduce
Same thing happens when you skip step 2.
4. Expected behavior
I expect the training to continue from where it left of.
5. Additional context
Eficientnet D1 from https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md
Default config and checkpoints.
6. System information
Checkpoint loading errors: