Apollo-XI commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ ] I am using the latest TensorFlow Model Garden release and TensorFlow 2. -> I'm using TF 1.15.2 as TF Object Detection isn't compatible with TF 2.
[x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[x] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

Model: http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320_coco_sync_2020_05_18.tar.gz

2. Describe the bug

When I load ssd_mobilenet_v2_mnasfpn_coco checkpoint to train in Oxford IIIT Pets dataset, TF throws several errores although train starts. However, model doesn't learn anything and sometimes error goes NaN as reported in #8549. The same errors appear when train script loads the model to perform evaluation.

3. Steps to reproduce

Finetune with Oxford Pets datasets changing coco config to adapt to the new dataset:

num_class: 37, batch_size:32, etc .

Code to reproduce the issue:

Open this gist (not mine): https://gist.github.com/NobuoTsukamoto/b2ca173b62e933ceeb1c7f0df42bca5f
Change model to ssdlite_mobilenetv2_mnas_fpn_model
Upload the config file appended to the end to /content/data/pipeline.config
Run all

4. Expected behavior

Model loads without errors and learns something.

5. Additional context

Install TF Object Detection: https://github.com/tensorflow/models/tree/master/research/object_detection

Message logs: log.txt

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Colab machine
TensorFlow installed from (source or binary): preinstalled
TensorFlow version (use command below): TF 1.15.2

Config configuration:

# SSD with MnasFPN feature extractor, shared box predictor
# See Chen et al, https://arxiv.org/abs/1912.01106
# Trained on COCO, initialized from scratch.
#
# 0.92B MulAdds, 2.5M Parameters. Latency is 193ms on Pixel 1.
# Achieves 26.6 mAP on COCO14 minival dataset.

# This config is TPU compatible

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 37
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 6
        anchor_scale: 3.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 3
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 64
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true,
            decay: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor: 4
        share_prediction_tower: true
        use_depthwise: true
        kernel_size: 3
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_mnasfpn'
      fpn {
        min_level: 3
        max_level: 6
        additional_layer_depth: 48
      }
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          random_normal_initializer {
            stddev: 0.01
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.97,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 32
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 32
  num_steps: 50000

    fine_tune_checkpoint: "/content/data/ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320_coco_sync_2020_05_18/model.ckpt"
  fine_tune_checkpoint_type:  "detection"

  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 4.
          total_steps: 50000
          warmup_learning_rate: .026666
          warmup_steps: 5000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/content/data/tf_record/pet_faces_train.record-?????-of-00010"
  }
  label_map_path: "/content/models/research/object_detection/data/pet_label_map.pbtxt"
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_examples: 1104
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/content/data/tf_record/pet_faces_val.record-?????-of-00010"
  }
  label_map_path: "/content/models/research/object_detection/data/pet_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

tombstone commented 4 years ago

@Apollo-XI Learning rate of 4.0 seems too high. I recommend trying to lower it by 1-2 orders of magnitude.

Apollo-XI commented 4 years ago

@tombstone I tried. I used 4.0 because the ssd_mv2_mnas_fpn config use it to train in Coco dataset. As, I saw gradient exploding before having NaN in the error, I lowered LR to 0.4 and 0.04. In 0.4, gradients exploded. In lr=0.04, I could train longer but the net didn't learn anything :/

rianrajagede commented 4 years ago

I used smaller learning rate, but I think the model didn't train in my custom dataset. The first loss is around 2. (previously when I use mobilenetV3 first loss is higher around 30.) and then after 5000 steps the loss is still around 2..

Learning rate setup:

cosine_decay_learning_rate {
      learning_rate_base: .001
      total_steps: 50000
      warmup_learning_rate: .00026666
      warmup_steps: 5000
  }

hattafudholi commented 4 years ago

it also happened to me when trying to train using this model. The starting loss is arround 2 and it stays in around 2 like forever, then suddenly exploding and the training stopped after few thousands of step. I still can't figure it out why. Really hope to see any updates from the others in this case as well. Cheers,

tensorflow / models

Errors while loading ssd_mobilenet_v2_mnasfpn_coco checkpoint #8581