tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.77k forks source link

Errors while loading ssd_mobilenet_v2_mnasfpn_coco checkpoint #8581

Open Apollo-XI opened 4 years ago

Apollo-XI commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

Model: http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320_coco_sync_2020_05_18.tar.gz

2. Describe the bug

When I load ssd_mobilenet_v2_mnasfpn_coco checkpoint to train in Oxford IIIT Pets dataset, TF throws several errores although train starts. However, model doesn't learn anything and sometimes error goes NaN as reported in #8549. The same errors appear when train script loads the model to perform evaluation.

3. Steps to reproduce

Finetune with Oxford Pets datasets changing coco config to adapt to the new dataset:

Code to reproduce the issue:

4. Expected behavior

Model loads without errors and learns something.

5. Additional context

Install TF Object Detection: https://github.com/tensorflow/models/tree/master/research/object_detection

Message logs: log.txt

6. System information

Config configuration:

# SSD with MnasFPN feature extractor, shared box predictor
# See Chen et al, https://arxiv.org/abs/1912.01106
# Trained on COCO, initialized from scratch.
#
# 0.92B MulAdds, 2.5M Parameters. Latency is 193ms on Pixel 1.
# Achieves 26.6 mAP on COCO14 minival dataset.

# This config is TPU compatible

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 37
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 6
        anchor_scale: 3.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 3
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 64
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true,
            decay: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor: 4
        share_prediction_tower: true
        use_depthwise: true
        kernel_size: 3
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_mnasfpn'
      fpn {
        min_level: 3
        max_level: 6
        additional_layer_depth: 48
      }
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          random_normal_initializer {
            stddev: 0.01
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.97,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 32
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 32
  num_steps: 50000

    fine_tune_checkpoint: "/content/data/ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320_coco_sync_2020_05_18/model.ckpt"
  fine_tune_checkpoint_type:  "detection"

  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 4.
          total_steps: 50000
          warmup_learning_rate: .026666
          warmup_steps: 5000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/content/data/tf_record/pet_faces_train.record-?????-of-00010"
  }
  label_map_path: "/content/models/research/object_detection/data/pet_label_map.pbtxt"
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_examples: 1104
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/content/data/tf_record/pet_faces_val.record-?????-of-00010"
  }
  label_map_path: "/content/models/research/object_detection/data/pet_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
tombstone commented 4 years ago

@Apollo-XI Learning rate of 4.0 seems too high. I recommend trying to lower it by 1-2 orders of magnitude.

Apollo-XI commented 4 years ago

@tombstone I tried. I used 4.0 because the ssd_mv2_mnas_fpn config use it to train in Coco dataset. As, I saw gradient exploding before having NaN in the error, I lowered LR to 0.4 and 0.04. In 0.4, gradients exploded. In lr=0.04, I could train longer but the net didn't learn anything :/

rianrajagede commented 4 years ago

I used smaller learning rate, but I think the model didn't train in my custom dataset. The first loss is around 2. (previously when I use mobilenetV3 first loss is higher around 30.) and then after 5000 steps the loss is still around 2..

Learning rate setup:

cosine_decay_learning_rate {
      learning_rate_base: .001
      total_steps: 50000
      warmup_learning_rate: .00026666
      warmup_steps: 5000
  }
hattafudholi commented 4 years ago

it also happened to me when trying to train using this model. The starting loss is arround 2 and it stays in around 2 like forever, then suddenly exploding and the training stopped after few thousands of step. I still can't figure it out why. Really hope to see any updates from the others in this case as well. Cheers,