Nans near end of training run

hhk7734 / tensorflow-yolov4

YOLOv4 Implemented in Tensorflow 2.

MIT License

136 stars 75 forks source link

Nans near end of training run #61

Closed wmcnally closed 3 years ago

wmcnally commented 3 years ago

Thank you for your nice implementation of YOLOv4 in TF2. I am using the tiny version on a custom dataset and sometimes I encounter nans during training. Do you have any ideas as to why this is happening? Is it possible for there to be a division by zero in the loss function?

hhk7734 commented 3 years ago

what is yolo version?

python3 -m pip show yolov4

hhk7734 commented 3 years ago

can you share your dataset sample?

wmcnally commented 3 years ago

Version 2.0.3. I'm using a custom tf data pipeline, so I cannot easily share. However, I'm not using any data augmentation and the nans only appear after several epochs, which leads me to believe my data loader is not the problem (i.e., the model sees all the samples several times before the nans appear).

hhk7734 commented 3 years ago

iou, obj, cls loss are all nan?

wmcnally commented 3 years ago

I turned loss verbose off, but here is the model.fit verbose:

Epoch 15/20 188/188 [==============================] - 226s 1s/step - loss: 2.7143 - output_1_loss: 2.3952 - output_2_loss: 0.0162 - val_loss: 8.0879 - val_output_1_loss: 7.7695 - val_output_2_loss: 0.0173 Epoch 16/20 188/188 [==============================] - 222s 1s/step - loss: nan - output_1_loss: nan - output_2_loss: 0.0114 - val_loss: nan - val_output_1_loss: nan - val_output_2_loss: 0.0000e+00

I will turn loss verbose back on and see if I can get more information.

hhk7734 commented 3 years ago

I have modified a lot of parts where nan may occur. and I am working on yolov4 v3.0.0 now and reviewing the math part again. Please leave a comment if you come up with a reason for nan other than the math part.

wmcnally commented 3 years ago

Ok that's great. Hopefully the issue is resolved in v3.0.0. Do you know when you will be releasing it?

hhk7734 commented 3 years ago

To improve the training speed, I'm converting some of the dataset related codes to c++. There are a few other works left, but it will probably take about a week.

wmcnally commented 3 years ago

Looks like the issue is related to conf_loss:

step before nan:

grid: 30*30 iou_loss: 2.21239805 conf_loss: 0.192396224 prob_loss: 0.0061624 total_loss 2.41095662
grid: 15*15 iou_loss: 0 conf_loss: 0.0158854965 prob_loss: 0 total_loss 0.0158854965

nan step:

grid: 30*30 iou_loss: 2.03567219 conf_loss: nan prob_loss: 0.00672992691 total_loss nan
grid: 15*15 iou_loss: 0 conf_loss: 0.0159363803 prob_loss: 0 total_loss 0.0159363803

step after nan:

grid: 30*30 iou_loss: 34.6766624 conf_loss: nan prob_loss: 89.1754837 total_loss nan
grid: 15*15 iou_loss: 0 conf_loss: 0 prob_loss: -0 total_loss 0

Note that all my bounding boxes are small (12x12 px using input size of 480x480), so I think that's why the iou loss for 15*15 grid is always 0.

wmcnally commented 3 years ago

Looking at the loss function now... if pred_conf - 1e-9 = 1 (line140), you will run into backend.log(0)... Do you think that could be it?

hhk7734 commented 3 years ago

It can be. K.epsilon and np.finfo(np.float32).eps are about 1e-7. v3.0 use 1e-7. for more safe, actually use K.binary_crossentropy. K is backend.

change 1e-9 to 1e-6 , 1e-7 or backend.epsilon()

hhk7734 commented 3 years ago

yolov4 v3.0.0 is released.

https://wiki.loliot.net/docs/lang/python/libraries/yolov4/python-yolov4-training

wmcnally commented 3 years ago

Thanks. Changing eps to 1e-7 in v2.0.3 worked for me.