tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

Stoping and restarting training script starts from scratch #9229

Open turowicz opened 4 years ago

turowicz commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection/model_main_tf2.py

2. Describe the bug

Contrary to TF1.x, in TF2.x when I stop training after a checkpoint, run evaluation and restart training the result is the model starts learning from scratch.

image

3. Steps to reproduce

  1. Run training and produce a checkpoint.
  2. Stop training and run evaluation.
  3. Restart training.

Same thing happens when you skip step 2.

4. Expected behavior

I expect the training to continue from where it left of.

5. Additional context

Eficientnet D1 from https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md

Default config and checkpoints.

6. System information

Checkpoint loading errors:

...
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.4.moving_variance
W0911 13:07:41.965147 140196924864320 util.py:150] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.4.moving_variance
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.axis
W0911 13:07:41.965251 140196924864320 util.py:150] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.axis
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.gamma
W0911 13:07:41.965382 140196924864320 util.py:150] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.beta
W0911 13:07:41.965474 140196924864320 util.py:150] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_mean
W0911 13:07:41.965547 140196924864320 util.py:150] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_variance
W0911 13:07:41.965606 140196924864320 util.py:150] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0911 13:07:41.965777 140196924864320 util.py:158] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
...
turowicz commented 4 years ago

@vighneshbirodkar this is the problem

vighneshbirodkar commented 4 years ago

Can you inspect the contents of your model dir (argument passed as --model_dir) and paste the contents here. And also the content of the checkpoint file. It would also be helpful to see the entire config file you are using.

turowicz commented 4 years ago

@vighneshbirodkar

I download the model from the following URL and don't change anything: http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz

I'm running on a GPU Nvidia V100.

My config is

# SSD with EfficientNet-b1 + BiFPN feature extractor,
# shared box predictor and focal loss (a.k.a EfficientDet-d1).
# See EfficientDet, Tan et al, https://arxiv.org/abs/1911.09070
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from an EfficientNet-b1 checkpoint.
#
# Train on TPU-8

model {
  ssd {
    num_classes: 7
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 640
        max_dimension: 640
        pad_to_max_dimension: true
      }
    }
    feature_extractor {
      type: "ssd_efficientnet-b1_bifpn_keras"
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029999999329447746
          }
        }
        activation: SWISH
        batch_norm {
          decay: 0.9900000095367432
          scale: true
          epsilon: 0.0010000000474974513
        }
        force_use_bias: true
      }
      bifpn {
        min_level: 3
        max_level: 7
        num_iterations: 4
        num_filters: 88
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 1.0
        x_scale: 1.0
        height_scale: 1.0
        width_scale: 1.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.009999999776482582
            }
          }
          activation: SWISH
          batch_norm {
            decay: 0.9900000095367432
            scale: true
            epsilon: 0.0010000000474974513
          }
          force_use_bias: true
        }
        depth: 88
        num_layers_before_predictor: 3
        kernel_size: 3
        class_prediction_bias_init: -4.599999904632568
        use_depthwise: true
      }
    }
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        scales_per_octave: 3
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.5
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 1.5
          alpha: 0.25
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    add_background_class: false
  }
}
train_config {
  batch_size: 8
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_scale_crop_and_pad_to_square {
      output_size: 640
      scale_min: 0.10000000149011612
      scale_max: 2.0
    }
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.07999999821186066
          total_steps: 300000
          warmup_learning_rate: 0.0010000000474974513
          warmup_steps: 2500
        }
      }
      momentum_optimizer_value: 0.8999999761581421
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "pre-trained-model/checkpoint/ckpt-0"
  num_steps: 300000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  use_bfloat16: true
  fine_tune_checkpoint_version: V2
}
train_input_reader: {
  label_map_path: "annotations/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "annotations/train.record"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

eval_input_reader: {
  label_map_path: "annotations/label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "annotations/test.record"
  }
}
vighneshbirodkar commented 4 years ago

If you are loading from a pre-trained checkpoint, these warnings are expected. According to the code here we are just loading the weights for _feature_extractor. We are not loading weights for _box_predictor because it lets you change the box prediction parameters according to your application. Tensorflow is just warning us about the fact that certain weights in the checkpoint are not used.

With that said, in spite of these warnings, training should be able to resume correctly. Can you attach the full logs of your training run ? I would need to see 2 logs, the first one in which you train from scratch and the second one from when you are resuming a training job.

turowicz commented 4 years ago

@vighneshbirodkar

Full logs attached: run-1.log run-2.log

Below you can see how the run number 2 doesn't pick up the loss, but starts from scratch. The step numbers are in line though.

Run 1 Loss:

INFO:tensorflow:Step 100 per-step time 0.767s loss=0.832
I0911 16:12:03.421945 140510505338688 model_lib_v2.py:652] Step 100 per-step time 0.767s loss=0.832
INFO:tensorflow:Step 200 per-step time 0.836s loss=0.557
I0911 16:13:23.431074 140510505338688 model_lib_v2.py:652] Step 200 per-step time 0.836s loss=0.557
INFO:tensorflow:Step 300 per-step time 0.855s loss=0.635
I0911 16:14:43.690119 140510505338688 model_lib_v2.py:652] Step 300 per-step time 0.855s loss=0.635
INFO:tensorflow:Step 400 per-step time 0.803s loss=0.745
I0911 16:16:03.250435 140510505338688 model_lib_v2.py:652] Step 400 per-step time 0.803s loss=0.745
INFO:tensorflow:Step 500 per-step time 0.794s loss=0.560
I0911 16:17:22.130386 140510505338688 model_lib_v2.py:652] Step 500 per-step time 0.794s loss=0.560
INFO:tensorflow:Step 600 per-step time 0.766s loss=0.475
I0911 16:18:41.020164 140510505338688 model_lib_v2.py:652] Step 600 per-step time 0.766s loss=0.475
INFO:tensorflow:Step 700 per-step time 0.830s loss=0.553
I0911 16:20:00.191405 140510505338688 model_lib_v2.py:652] Step 700 per-step time 0.830s loss=0.553
INFO:tensorflow:Step 800 per-step time 0.751s loss=0.389
I0911 16:21:19.443584 140510505338688 model_lib_v2.py:652] Step 800 per-step time 0.751s loss=0.389
INFO:tensorflow:Step 900 per-step time 0.793s loss=0.411
I0911 16:22:38.986910 140510505338688 model_lib_v2.py:652] Step 900 per-step time 0.793s loss=0.411
INFO:tensorflow:Step 1000 per-step time 0.837s loss=0.455
I0911 16:23:57.950603 140510505338688 model_lib_v2.py:652] Step 1000 per-step time 0.837s loss=0.455
^C

Run 2 Loss:

INFO:tensorflow:Step 1100 per-step time 0.794s loss=0.805
I0911 16:28:03.628168 139671625799488 model_lib_v2.py:652] Step 1100 per-step time 0.794s loss=0.805
INFO:tensorflow:Step 1200 per-step time 0.783s loss=0.633
I0911 16:29:23.007995 139671625799488 model_lib_v2.py:652] Step 1200 per-step time 0.783s loss=0.633
INFO:tensorflow:Step 1300 per-step time 0.785s loss=0.781
I0911 16:30:42.642542 139671625799488 model_lib_v2.py:652] Step 1300 per-step time 0.785s loss=0.781
INFO:tensorflow:Step 1400 per-step time 0.785s loss=0.705
I0911 16:32:02.760208 139671625799488 model_lib_v2.py:652] Step 1400 per-step time 0.785s loss=0.705
INFO:tensorflow:Step 1500 per-step time 0.762s loss=0.548
I0911 16:33:21.996925 139671625799488 model_lib_v2.py:652] Step 1500 per-step time 0.762s loss=0.548
^C
vighneshbirodkar commented 4 years ago

I understand the issue now. I will do a bit more investigating.

turowicz commented 4 years ago

@vighneshbirodkar @saikumarchalla @gowthamkpr any ideas?

ghost commented 4 years ago

I also trained EfficientDet D4 with pre-trained weights with my own dataset. Before restarting the training, I changed fine_tune_checkpoint argument value to an already trained weight path in the pipeline_config file. this is the repo that I am using. https://github.com/jahongir7174/EfficientDet Sorry if I understand wrongly

turowicz commented 4 years ago

@jahongir7174 It probably will work but in TF 1.x it picked the newly created checkpoints automatically.

rahmanabidchy commented 4 years ago

I was thinking of opening a similar issue to this but thank God I didn't. I also trained an EfficientDet d0 model and stopped after getting a 0.6 loss value. I then decided I will train it further and so edited the config file to point to the latest checkpoint file. However, when using that checkpoint for training it seems the model forgets everything and starts re-learning. takes about the same time and steps as before to converge to 0.6.