tensorflow / models

Models and examples built with TensorFlow
Other
76.95k stars 45.79k forks source link

loss explodes after few iterations #3868

Closed dotannn closed 6 years ago

dotannn commented 6 years ago

System information What is the top-level directory of the model you are using: object_detection

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): use train.py script on my own dataset

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04

TensorFlow installed from (source or binary): installed from pip

TensorFlow version (use command below): 1.6

CUDA/cuDNN version: cuda 9.0

GPU model and memory: AWS - p3.2xlarge instance - using V100

I'm trying to train faster-rcnn resnet101 on my own dataset (something I did successfully a few months ago).
after a few steps, the loss gets insanely big, do you have any direction why it could happen?

RUNNING ON AWS - p3.2xlarge instance

INFO:tensorflow:global step 435: loss = 0.0630 (0.408 sec/step) INFO:tensorflow:global step 435: loss = 0.0630 (0.408 sec/step) INFO:tensorflow:global step 436: loss = 0.1628 (0.404 sec/step) INFO:tensorflow:global step 436: loss = 0.1628 (0.404 sec/step) INFO:tensorflow:global step 437: loss = 0.0577 (0.397 sec/step) INFO:tensorflow:global step 437: loss = 0.0577 (0.397 sec/step) INFO:tensorflow:global step 438: loss = 54059178029467303936.0000 (0.403 sec/step) INFO:tensorflow:global step 438: loss = 54059178029467303936.0000 (0.403 sec/step) INFO:tensorflow:global step 439: loss = 432537635714800549888.0000 (0.400 sec/step) INFO:tensorflow:global step 439: loss = 432537635714800549888.0000 (0.400 sec/step) INFO:tensorflow:global step 440: loss = 2106942657570782838784.0000 (0.405 sec/step) INFO:tensorflow:global step 440: loss = 2106942657570782838784.0000 (0.405 sec/step) INFO:tensorflow:global step 441: loss = 7774169971762292326400.0000 (0.401 sec/step) INFO:tensorflow:global step 441: loss = 7774169971762292326400.0000 (0.401 sec/step) INFO:tensorflow:global step 442: loss = 25262924095336287830016.0000 (0.404 sec/step) INFO:tensorflow:global step 442: loss = 25262924095336287830016.0000 (0.404 sec/step)

yhliang2018 commented 6 years ago

Can you please:

That will allow us to better understand the problem.

sk-g commented 6 years ago

It is really hard to say why something is happening without some context. But my guess is that you are using a high learning rate that is making your model wiggle around the optimal.

I would suggest you to use learning rate scheduler that exponentially decreases with global step. This documentation might give an insight on how to use the learning rate scheduler. This helped when I faced similar issues, hope this helps. This learning rate decay is easy to code but should you run into trouble, here is a stackoverflow link that shows an example of decayed learning rate usage.

dotannn commented 6 years ago

I'm fine-tunning faster-rcnn-resnet101-coco model with my own dataset using 7 labels.

my initial learning rate is 0.0003. I don't see a standard 'jump' around optimum that usually seen when the LR is too high, my loss is decreasing as expected for hundreds or steps and then out of nowhere explodes to this very big numbers.

added general info

sk-g commented 6 years ago

Hi @dotannn , like you say the loss decreased for some steps and then it exploded it could be because of many reasons, my first bet was like I mentioned before. You could use a decayed learning rate so your model converges faster but if you do not want to do that, I think the next obvious reason could be vanishing gradients, assuming you are not familiar with that concept, maybe some gradients are so small that they are clipped to 0 this is a huge problem if many weights suddenly are clipped to 0, you WILL 'jump' across the optimal point and this will be chained to rest of the weights as well. If you are familiar with the problem, apologies.

But I would still emphasise on the 'jumping' part. For example, look at this from your output:

INFO:tensorflow:global step 435: loss = 0.0630 (0.408 sec/step)
INFO:tensorflow:global step 436: loss = 0.1628 (0.404 sec/step)
INFO:tensorflow:global step 436: loss = 0.1628 (0.404 sec/step)
INFO:tensorflow:global step 437: loss = 0.0577 (0.397 sec/step)

we can clearly see the phenomenon in this part.

dotannn commented 6 years ago

Thanks, @sk-g, Actually that was a memory issue...
closing it now

RoloffMatek commented 6 years ago

Hello, I'm happy that I found this thread, since I'm facing the exact same issue. And this thread is the first on this problem. It would be great to hear what @dotannn your solution was.

I tried lower learning rates but it didn't have any impacts. Even with very low learning rates the problem ocurred. Always after about 500 steps. Then after about 30k to 50k steps the problem disappeared. Can anybody help me please? I would be so glad!

felipecicero commented 6 years ago

I'm having the same problem, but the most common gradient is getting static and after some interactions, it suddenly explodes. With higher Learning Rates, the gradient grows exponentially.

I have tested numerous combinations of crop, learning rate, batch size, atrous rates, and nothing changes this pattern.

ghost commented 6 years ago

Same Problem over here for faster rcnn using resnet101 and resnet50 For me the loss is decreasing as expected, but after 20/30/40k steps the loss explodes. After that it comes back to the original level (below 1 for rpn, below 5 for 2nd stage) but doesn't produce any meaningful boxes anymore. I have already varied:

I have lowered the score level, so I can see all boxes the net predicts and it turns out, that it gets invariant towards the input, meaning it predicts always the same pattern of boxes.

I am really stuck. Did anybody resolve that issue? @sk-g @felipecicero @RoloffMatek @dotannn

felipecicero commented 6 years ago

I have managed to improve the convergence a bit using last_layer_gradient_multiplier to 0.1 instead of 10 as I have seen in some cases here in the topics, I am also using slow_start_step for a few hundred iterations with 0.007 and then the momentum with 0.0001. I removed the random scale as well.

But sometime before reaching the global minimum the loss always ends up exploding and showed no signs of improvement in any tests I did.

I am having several other problems related to training with my own dataset in deeplab, I have not yet been able to report everything here due to my deadlines.

I am using 8 P100 GPUs.

python deeplab/train.py \ --logtostderr \ --training_number_of_steps=250000 \ --train_split="train" \ --model_variant="xception_71" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --train_crop_size=513 \ --train_crop_size=513 \ --train_batch_size=24 \ --num_clones=4 \ --base_learning_rate=0.0001 \ --last_layer_gradient_multiplier=0.1 \ --slow_start_step=500 \ --slow_start_learning_rate=0.007 \ --fine_tune_batch_norm=True ...

ghost commented 6 years ago

Thanks for your quick reply!

I was also able to shift the loss explosion through hyperparameter tuning, but actually I found an error which seems to cause the problem in my case.

I was using a corrupt label file, which caused some incompatibility while learning. So creating new tf records and a complete newly parametrized setup fixed it. I know it seems obvious but it took me quite a while to actually notice that there was something wrong with the labels Loss explosion due to corrupt labels was also reported in some threads on stackoverflow (sry I don't have the links anymore). Hope it helps!

valleu326 commented 6 years ago

I've got this issue too, but I found the reason, the label_map.pbtxt contain 5 labels, but in my pipeline.config: model { faster_rcnn { num_classes: 4}}, which tells the machine only has 4 classes,if the network predict the 5th one, the previous 4 's probability will all be 0, which will generate a very large loss value(log(0) is big).

lxq2t commented 6 years ago

Issue may be caused by: damaged bounding boxes (out of image boundary or xMin>xMax) mismatched category names in tfrecord and map.pbtxt I have same issue with exploding loss after few iterations, fixed by checking names in map.pbtxt

felipecicero commented 6 years ago

Where do I find this map.pbtxt file? In the tutorial for training is not explained about this file.

eypros commented 6 years ago

@felipecicero map.pbtxt is just a text file which contains the labels of each object class (class with id xx has name yyy etc). Since you have trained your model you have use one in youir config file in your train_input_reader section label_map_path: "path to your map.pbtxt".

cannguyen275 commented 5 years ago

Issue may be caused by: damaged bounding boxes (out of image boundary or xMin>xMax) mismatched category names in tfrecord and map.pbtxt I have same issue with exploding loss after few iterations, fixed by checking names in map.pbtxt

@highemhigh Thanks so much! After change category names in map.pbtxt to match with tfrecord, it works.

yanfengliu commented 5 years ago

@highemhigh wow I never thought about the labelmap file. This solved my problem. Thank you so much!!

Siggi1988 commented 5 years ago

Hallo my Name is Siggi,

and how can i Open the tfrecord files?

53RT commented 5 years ago

@Siggi1988

you need to decode the tfrecord files if you want to have a look inside. This should help you to do that.

imayachita commented 5 years ago

Thanks, @sk-g, Actually that was a memory issue... closing it now

@dotannn may I know how you resolve this memory issue? I think I have the same problem. Thanks!

loi-nguyen-khanh commented 4 years ago

Issue may be caused by: damaged bounding boxes (out of image boundary or xMin>xMax) mismatched category names in tfrecord and map.pbtxt I have same issue with exploding loss after few iterations, fixed by checking names in map.pbtxt

Many thank for your reply. If dont have your reply, I will take long time to solve my problem @@.

angyee commented 4 years ago

Thanks, @sk-g, Actually that was a memory issue... closing it now

not understood, ?

PLICHET commented 4 years ago

I had the same problem because of a difference between the number of classes in pipeline.config and the number of classes in the data. "faster_rcnn_inception_v2_pets.config" Check that out.

Boltuzamaki commented 4 years ago

I am using faster_rcnn_resnet50_coco and facing same problem

1) I relabelled my dataset from scratch too to avoid any annotation error . 2) I rechecked labelmap.txt many times . 3) I rechecked classes and also I tried to change different learning rates and gradient clipping . 4) My tfrecord is good too 5) I recheck csv files which created .

I am training on only one class but gradient is exploding after few iterations exponentially please help

Following is my config file

model { faster_rcnn { num_classes: 1 image_resizer { keep_aspect_ratio_resizer { min_dimension: 400 max_dimension: 600 } } feature_extractor { type: 'faster_rcnn_resnet50' first_stage_features_stride: 16 } first_stage_anchor_generator { grid_anchor_generator { scales: [0.25, 0.5, 1.0, 2.0] aspect_ratios: [0.5, 1.0, 2.0] height_stride: 16 width_stride: 16 } } first_stage_box_predictor_conv_hyperparams { op: CONV regularizer { l2_regularizer { weight: 0.0 } } initializer { truncated_normal_initializer { stddev: 0.01 } } } first_stage_nms_score_threshold: 0.0 first_stage_nms_iou_threshold: 0.7 first_stage_max_proposals: 300 first_stage_localization_loss_weight: 2.0 first_stage_objectness_loss_weight: 1.0 initial_crop_size: 14 maxpool_kernel_size: 2 maxpool_stride: 2 second_stage_box_predictor { mask_rcnn_box_predictor { use_dropout: false dropout_keep_probability: 1.0 fc_hyperparams { op: FC regularizer { l2_regularizer { weight: 0.0 } } initializer { variance_scaling_initializer { factor: 1.0 uniform: true mode: FAN_AVG } } } } } second_stage_post_processing { batch_non_max_suppression { score_threshold: 0.0 iou_threshold: 0.6 max_detections_per_class: 100 max_total_detections: 300 } score_converter: SIGMOID } second_stage_localization_loss_weight: 2.0 second_stage_classification_loss_weight: 1.0 } }

train_config: { batch_size: 1 optimizer { momentum_optimizer: { learning_rate: { manual_step_learning_rate { initial_learning_rate: 0.0001 schedule { step: 900000 learning_rate: .000001 } schedule { step: 1200000 learning_rate: .000001 } } } momentum_optimizer_value: 0.9 } use_moving_average: false } gradient_clipping_by_norm: 5.0 fine_tune_checkpoint: "/content/drive/My Drive/Tensorflow/models/faster_rcnn_resnet50_coco_2018_01_28/model.ckpt" from_detection_checkpoint: true

num_steps: 200000 data_augmentation_options { random_horizontal_flip { } } }

train_input_reader: { tf_record_input_reader { input_path: "/content/drive/My Drive/Tensorflow/models/train.record" } label_map_path: "/content/drive/My Drive/Tensorflow/models/training/labelmap.pbtxt" }

eval_config: { num_examples: 36

max_evals: 10 }

eval_input_reader: { tf_record_input_reader { input_path: "/content/drive/My Drive/Tensorflow/models/test.record" } label_map_path: "/content/drive/My Drive/Tensorflow/models/training/labelmap.pbtxt" shuffle: false num_readers: 1 }

rggs commented 4 years ago

I am training on only one class but gradient is exploding after few iterations exponentially

I'm having the same issue with training one class. I wonder if this is an issue unique to using one class?

DynamicCodes commented 4 years ago

i'm having the same problem training Mask_rcnn, the loss explodes after 500 steps, i have used only 1 class and my dataset is also very small and the training is on colab, so their should not be any memory issue. i'm not getting the reason after this, any solution will be appreciated !

erolgerceker commented 4 years ago

i'm having the same problem training Mask_rcnn, the loss explodes after 500 steps, i have used only 1 class and my dataset is also very small and the training is on colab, so their should not be any memory issue. i'm not getting the reason after this, any solution will be appreciated !

because of generate_tfrecord.py fle. arrange it.

like this:

for index, row in group.object.iterrows(): xmins.append(row['xmin'] / width) xmaxs.append(row['xmax'] / width) ymins.append(row['ymin'] / height) ymaxs.append(row['ymax'] / height) classes_text.append((str(row['class'])).encode('utf8'))

classes_text.append(row['class'].encode('utf8')) #if classes are not string, try this one

    classes.append(class_text_to_int(row['class']))
saqibshakeel035 commented 4 years ago

Does the faster_RCNN_inspection_V2 requires the bounding box information as well. I am using only the mask information (Polygons) to generate the TFRecords. Am I missing something?

i'm having the same problem training Mask_rcnn, the loss explodes after 500 steps, i have used only 1 class and my dataset is also very small and the training is on colab, so their should not be any memory issue. i'm not getting the reason after this, any solution will be appreciated !

because of generate_tfrecord.py fle. arrange it.

like this:

for index, row in group.object.iterrows(): xmins.append(row['xmin'] / width) xmaxs.append(row['xmax'] / width) ymins.append(row['ymin'] / height) ymaxs.append(row['ymax'] / height) classes_text.append((str(row['class'])).encode('utf8'))

classes_text.append(row['class'].encode('utf8')) #if classes are not string, try this one

classes.append(class_text_to_int(row['class']))

joelbudu commented 3 years ago

@DynamicCodes @rggs, or anyone else, have you been able to find a solution? I am also facing the same challenge training a mask r-cnn model using one class on a small dataset. Seems to be a peculiar problem

rggs commented 3 years ago

@DynamicCodes @rggs, or anyone else, have you been able to find a solution? I am also facing the same challenge training a mask r-cnn model using one class on a small dataset. Seems to be a peculiar problem

In my case it turned out to be a simple case of misspellings between the labeled objects and the classes in the tfrecord files. It may be worth reading through the csvs that are output to make sure the labelled objects match the class(es) you tell it to look for.

joelbudu commented 3 years ago

@rggs Thanks for this.

In case anyone else used the tutorial from https://github.com/vijendra1125/Custom-Mask-RCNN-Using-Tensorfow-Object-Detection-API there was an issue with how the label.pbtxt are used for training.

Refer to this comment: https://github.com/vijendra1125/Custom-Mask-RCNN-Using-Tensorfow-Object-Detection-API/issues/25#issuecomment-699052106

parthlathiya2697 commented 3 years ago

@rggs Thanks for this.

In case anyone else used the tutorial from https://github.com/vijendra1125/Custom-Mask-RCNN-Using-Tensorfow-Object-Detection-API there was an issue with how the label.pbtxt are used for training.

Refer to this comment: vijendra1125/Custom-Mask-RCNN-Using-Tensorfow-Object-Detection-API#25 (comment)

Did you get any solution to this problem?

parthlathiya2697 commented 3 years ago

Still got the same loss explosion after 25000 steps. I am training mobilenet v1 from tensorflow models zoo to train an object detection model to detect only balls 🎾 . Using mobilenet's configuration pipeline, I've edited num classes to be 1 and edited label_map_path=ball.pbtxt which has only one item i.e., 'ball' itself (cross checked with the .record file too). I also tried reducing batch_size if that matters but still get the same issue.

Edit: I annotated all images again and generated xml files and then converted to .record files. Now, the loss explosion problem does not show up and training goes smooth.

New Issue With the same .record files, when I started training mobilenet v2, the loss explosion occurs again. Check the pipeline.config again. num_classes=1

Checkout this article gives quite a clearer picture. But can't get sure how to fix it.

Stackoverflow answers suggests:

To discuss the potential reasons for this explosion, it could probably because of a nasty combination of random initialisation of weights, learning rate, and also probably the batch of training-data which was passed during the iteration.

Without knowing the exact details of the model, you should try smaller learning rate and probably shuffle your training data well. Hope this somewhat helps.

In the case of deep neural networks, this can occur due to the exploding/vanishing gradient. You may want to do either do weight clipping or adjust weight initialization such that weights are closer to 1 so that the chances of explosion reduces.

Also, if your learning rate is big, then such a problem can occur. In such case, you can either lower down the learning rate or use learning rate decay as well.

Could be exploding gradient, i.e. one very big gradient step makes your model "jump" to some extremely far away point where it gets really bad loss, and then it has to "recover" from this slowly. This is a problem especially for RNNs.

Has anyone come to any other conclusions? or know how to make these changes in training mobilenetv2 from object detction zoo?

An Update on my work further So, I had only one option to go for the reasoning of loss explosion. So, I looked up gradient descent explosion and vanish which led me to changing settings in ssd_mobileenet_v2.config lowering the learning_rate_base to 0.008 from 0.800000011920929 and warmup_learning_rate to 0.0013333 from 0.13333000242710114.

Now my learning is quite stable. Fluctuates by decimal points but thats fine with me. The tradeoff with this method is that now your training time has increased if your up with that. Stop training when loss reaches desired value. Refer tensorboard to visualise loss drop.

Loss explosion in config optimizer { momentum_optimizer: { learning_rate: { cosine_decay_learning_rate { learning_rate_base: .008 total_steps: 90000 warmup_learning_rate: 0.0013333 warmup_steps: 1000 } } momentum_optimizer_value: 0.9 } use_moving_average: false } max_number_of_boxes: 100 unpad_groundtruth_tensors: false }

Edited config

optimizer { momentum_optimizer: { learning_rate: { cosine_decay_learning_rate { learning_rate_base: .008 total_steps: 90000 warmup_learning_rate: 0.0013333 warmup_steps: 1000 } } momentum_optimizer_value: 0.9 } use_moving_average: false } max_number_of_boxes: 100 unpad_groundtruth_tensors: false }

Good luck 🤩 I've also posted an article regarding this in depth. You can check out here.

SaimAli420 commented 3 years ago

I was facing the same issue. But later on, I found that this error occurs due to differences in class names or number of classes in file.record and file.pbtxt. So, you need to pass the same name and number of classes in file.pbtxt as it is as you pass in tf_record while generating the tfrecords. example: generate_tfrecord.py(while generating tf_record)_ def class_text_to_int(row_label): if row_label == 'cat': return 1 else: return 0 file.pbtxt item { id: 1 name: 'cat' }

bratyslav commented 3 years ago

In my case, standardizing the input pixels of the images to [-1; 1] helped.

nguyenthekhoig7 commented 5 months ago

For those who is using one-class dataset, I believe that the num_classes in config file should be set to 2 (add 1 for background I guess), see this Tutorial-OD

num_classes = 3 ... exp_config.task.model.num_classes = num_classes + 1