After 44145 steps and getting "loss nan - moving ave loss nan"

Sonymon commented 6 years ago

Taken the following parameters

I have gone through several related issues but unable to solve: Also checked, if for any annotation xmin>xmax or ymin>ymax. But everything is alright. I have taken 200 images x 8 classes

Used the following command retrain started from 43250 steps flow --model cfg/tiny-yolo-voc-1c.cfg --load 43250 --train --annotation train/annotations --dataset train/images --gpu 0.6 --epoch 4000

[net]
batch=32
subdivisions=8
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
max_batches = 40100
policy=steps
steps=-1,100,20000,30000
scales=.1,10,.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

###########

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=65
activation=linear

[region]
anchors = 1.08,1.19,  3.42,4.41,  6.63,11.38,  9.42,5.11,  16.62,10.52
bias_match=1
classes=8
coords=4
num=5
softmax=1
jitter=.2
rescore=1

object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1

absolute=1
thresh = .5
random=1

System Configuration NVIDIA GEFORCE GTX 1060 6gb Ram 16 gb 256 gb ssd Core i7 7th Gen

sandeeprepakula commented 6 years ago

Can you try reducing the learning rate and resume the learning from the latest check point?

Sonymon commented 6 years ago

reduced learning-rate from 0.001 to 0.0001. now nan issue started at step 44681

Sonymon commented 5 years ago

The problem is not solved yet. Kindly help me out.

monoloxo commented 5 years ago

maybe you can try increasing the batch size or reducing the learning rate

bluesy7585 commented 5 years ago

I have this problem too. I'm training tiny-yolo v2 with 1 or 3 classes (voc2007 class "car", "bus"," motorbike") it happens after few epochs, so i think data annotation is correct lower learning rate does not help. (can not use larger batch size, cuz my GPU only has 4g memory)

any one has solution? thanks EDIT: comment in #793 use --trainer adam can avoid this problem, works for me! :)

Zev-Yin commented 5 years ago

maybe you can change the optimizer (ex:adam) and reduce learning rate

thtrieu / darkflow

After 44145 steps and getting "loss nan - moving ave loss nan" #906