tensorflow / models

Models and examples built with TensorFlow
Other
77.02k stars 45.78k forks source link

Loss increases suddenly after training the model nicely for ~30 mins #5165

Closed harshilpatel312 closed 4 years ago

harshilpatel312 commented 6 years ago

System information

Describe the problem

I gathered labelled data for object detection using my own camera, trained the model, ran predictions, everything works okay. Then I decided to supplement the data with labelled data from Open Images Dataset: cleaned up data, added zero padding to resize it to 1920x1080, and trained on it. The loss decreased steadily, as expected, for ~30 mins, after which it suddenly increases and the model never converges after that (see attached TotalLoss log plots).

Could someone tell me what's wrong? I'm not sure if it is a bug or if I'm doing anything wrong.

ZOOMED IN :

Zoomed In

ZOOMED OUT :

Zoomed Out

tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. Have I written custom code

mawanda-jun commented 6 years ago

Hi, could you please share your .config file, if it's different from the standard one?

harshilpatel312 commented 6 years ago

@mawanda-jun In addition to changing 'num_classes' and adding the .ckpt file, I made the following changes to the .config file:

train_input_reader: {
  tf_record_input_reader {
    input_path: "data/train.record"
    input_path: "data/train_oid.record"
  }
  label_map_path: "data/turk_map.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "data/test.record"
    input_path: "data/test_oid.record"
  }
  label_map_path: "data/turk_map.pbtxt"
  shuffle: false
  num_readers: 1
}
mawanda-jun commented 6 years ago

Sorry, it's not clear to me: why do you have two unrelated input_paths?

harshilpatel312 commented 6 years ago

@mawanda-jun As far as I know, you cannot append new or additional data to your tfrecords, unless you decide to generate it from scratch again. Therefore, if you want to train on new as well as your old dataset, one way to do it is that you have multiple input paths, each corresponding to tfrecord of different dataset.

Here, the first input path is the dataset I collected with my camera and the second input path is the tfrecords for Open Images Dataset.

mawanda-jun commented 6 years ago

Ok, there is always something new to learn. :) However could you please try keeping only one record and see if one of the two records are not performing well?

P.S. is it right the name test_o"i"d.record?

harshilpatel312 commented 6 years ago

Yeah, generating combined dataset from scratch was next up on my list. I will revert back with results soon.

I'm sorry, I didn't get your question about the name. If you thought I mispelled "oid" for "old", then no; "oid' stands for Open Images Dataset..

mawanda-jun commented 6 years ago

Hi, did you succeeded in solving the problem?

harshilpatel312 commented 6 years ago

Sorry for the late reply..I was stuck with something else at work..

I did try combining the dataset and then training, but I'm facing the same problem: Trains well for some time, then loss starts increasing suddenly.

jillelajitta commented 6 years ago

Hi, Sorry for hijacking this thread. I'm getting following warnings when I'm trying to retrain my model, I am using SSD+GoogleNet. Please someone help me.

Thanks.

WARNING:tensorflow:Ignoring ground truth with image id 2132974076 since it was previously added WARNING:tensorflow:Ignoring detection with image id 2132974076 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 483474429 since it was previously added WARNING:tensorflow:Ignoring detection with image id 483474429 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 1042541771 since it was previously added WARNING:tensorflow:Ignoring detection with image id 1042541771 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 374107777 since it was previously added WARNING:tensorflow:Ignoring detection with image id 374107777 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 851704672 since it was previously added WARNING:tensorflow:Ignoring detection with image id 851704672 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 1267094741 since it was previously added WARNING:tensorflow:Ignoring detection with image id 1267094741 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 674017641 since it was previously added WARNING:tensorflow:Ignoring detection with image id 674017641 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 2011324514 since it was previously added WARNING:tensorflow:Ignoring detection with image id 2011324514 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 655823972 since it was previously added WARNING:tensorflow:Ignoring detection with image id 655823972 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 1069886348 since it was previously added WARNING:tensorflow:Ignoring detection with image id 1069886348 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 432647899 since it was previously added WARNING:tensorflow:Ignoring detection with image id 432647899 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 192947873 since it was previously added WARNING:tensorflow:Ignoring detection with image id 192947873 since it was previously added

harshilpatel312 commented 6 years ago

@jillelajitta I see you've posted this same issue multiple times. Be patient, someone will reply. I would recommend Googling your issue. The first link seems helpful.

mawanda-jun commented 6 years ago

Then I really don't know how to help. My last attempt: could you please share you whole config file, so I can see it and understand the problem better? I think that the problem can be related to the config of the optimizer parameter, but I'm not sure...

harshilpatel312 commented 6 years ago

Here you go!

faster_rcnn_inception_v2_coco.config

dkloving commented 6 years ago

@harshilpatel312 I'm having the same issue as @jillelajitta and this is the first link that google suggests. Can you provide the one that you are seeing?

mawanda-jun commented 6 years ago

@dkloving it's funny, I answered @jillelajitta in this post. I think you didn't see it at first result because the thread is not signed as solved. Tell me there if you succeeded in solving the problem.

mawanda-jun commented 6 years ago

@harshilpatel312 the mistery goes deeper. Well, the only difference I can notice with mine is at:

eval_config: {
  num_examples: 480
  eval_interval_secs: 150
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  # max_evals: 10
}

See if this configuration is useful for you. In yours you tell tensorflow to stop at 10 evaluations, my config tells tensorflow to continuously evaluate, but every 150 seconds. I can't tell if there is any correlation, but maybe it's a bug and you can solve it not telling tensorflow to stop evaluating...

I attach my working config file so you can see the other differences!

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      fixed_shape_resizer {
        width: 400
        height: 400
      #keep_aspect_ratio_resizer {
        #min_dimension: 400
        #max_dimension: 800
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_v2'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.00001
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 4
  optimizer {
    # momentum_optimizer {
    adam_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.0001
          decay_steps: 600
          decay_factor: 0.95
        }
      }
      # momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/path/to/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the COCO dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/path/to/train.record"
  }
  label_map_path: "/path/to/object-detection.pbtxt"
}

eval_config: {
  num_examples: 480
  eval_interval_secs: 150
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  # max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/path/to/test.record"
  }
  label_map_path: "/path/to/object-detection.pbtxt"
  shuffle: false
  num_readers: 1
}
harshilpatel312 commented 6 years ago

Tried changing the eval_config too, doesn't seem to work.

dkloving commented 6 years ago

@mawanda-jun Thanks, that solved my issue!

Victorsoukhov commented 6 years ago

I have the same problem. And it was solved by setting correcting code as described in https://github.com/tensorflow/models/issues/4856 and num_examples: 1 in eval_config section of the pipeline config file. As was pointed in the documentation - [https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md] and as you can see in proto files [https://github.com/tensorflow/models/blob/master/research/object_detection/protos/eval.proto] num_examples - is the batch size of the evaluation.

harshilpatel312 commented 6 years ago

Updated the code in model_lib.py and changed num_examples to 1. Does not work..

super-penguin commented 5 years ago

@harshilpatel312 Hi, did you succeed in solving the problem? I am having similar issues and can't figure out the solution.

harshilpatel312 commented 5 years ago

@super-penguin Nope, I moved on to using something else.

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.