Nan loss during training in object detection api

zychen2016 commented 4 years ago

System information

What is the top-level directory of the model you are using: models
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 18.04
TensorFlow installed from (source or binary):pip install
TensorFlow version (use command below):1.15
Bazel version (if compiling from source):No
CUDA/cuDNN version:1080Ti
GPU model and memory:11G
Exact command to reproduce:

Describe the problem

Use ssd_mobilenet_v2_fpnlite_quantized_shared_box_predictor_256x256_depthmultiplier_75_coco14_sync.config train model.

Just comment the sync options,Because I train model On one Gpu.

# FPNLite with Mobilenet v2 0.75 depth multiplied feature extractor and focal
# loss.
# Trained on COCO14, initialized from Imagenet classification checkpoint

# Achieves 20.0 mAP on COCO14 minival dataset.
# This config is TPU compatible. Search for "PATH_TO_BE_CONFIGURED" to find the
# fields that should be configured.

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    #num_classes: 90
    num_classes:11
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 2
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 256
        width: 256
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 128
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true,
            decay: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor: 4
        share_prediction_tower: true
        use_depthwise: true
        kernel_size: 3
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_fpn'
      use_depthwise: true
      fpn {
        min_level: 3
        max_level: 7
        additional_layer_depth: 128
      }
      min_depth: 16
      depth_multiplier: 0.75
      #depth_multiplier: 1
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          random_normal_initializer {
            stddev: 0.01
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint: "/data2/CZY/data/ssd/models-master/research/object_detection/ssd_fpn/mobilenet_v2_1.0_224/mobilenet_v2_1.0_224.ckpt"
  batch_size: 24
  #sync_replicas: true
  #startup_delay_steps: 0
  #replicas_to_aggregate: 32
  num_steps: 100000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
     max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 0.4
          total_steps: 100000
          warmup_learning_rate: .026666
          warmup_steps: 1000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/data2/CZY/data/ssd/models-master/research/object_detection/ssd_fpn/train.record"
  }
  label_map_path: "/data2/CZY/data/ssd/models-master/research/object_detection/ssd_fpn/grape.pbtxt"
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_examples: 8000
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data2/CZY/data/ssd/models-master/research/object_detection/ssd_fpn/val.record"
  }
  label_map_path: "/data2/CZY/data/ssd/models-master/research/object_detection/ssd_fpn/grape.pbtxt"
  shuffle: false
  num_readers: 1
}

graph_rewriter {
  quantization {
    delay: 30000
    activation_bits: 8
    weight_bits: 8
  }
}

For transfering learning,I download mobilenet_v2_1.0_224 from https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet

When training,Then error reported.

errors

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

zychen2016 commented 4 years ago

Any one could help me?

zychen2016 commented 4 years ago

Could you give some advice about how to train ssd_mobilenetv2_fpn using Imagenet classification checkpoint? Thanks

satoshiSchubert commented 4 years ago

to summarize: " Please train on CPU for a few steps, the error message is much better." from https://github.com/tensorflow/tensor2tensor/issues/574#issuecomment-364722156

hello @zychen2016 i also encounted this error when training quantized model(i noticed you add 'graph_rewriter' lines in the end of the config file ). For me, when training without quantized lines(code lines in the end of the config file), it works well; but when i want to do Quantization-aware Training(using cofig just like yours), error occured. The solution for me is to uninstall tensorflow-gpu and download tensorflow(cpu version instead), it worked for me.

menahishayan commented 3 years ago

Reduce one or more of the following value:

Learning rate
Warmup learning rate
Momentum Optimizer Value

You'll have to experiment with different values and different combinations of the above to get it just right

tensorflow / models

Nan loss during training in object detection api #7904

System information

Describe the problem