Nan Error while training on coco/pascalvoc/custom dataset after random number of steps

hkuyam008 commented 6 years ago

I tried training luminoth with standard existing coco, pascal voc and on my own dataset converted to coco format (eventually all were transformed to tfrecords format using transform tool). After running for an hour or couple of hours, each time training terminated with the same error - Error trace attached.

I am issuing a simple lumi train -c config.xml command. I have also attached the dataset dir zip and the config file for reference.

train error trace.txt tf.zip config.zip

dekked commented 6 years ago

Looks like the loss is exploding. Try to lower your learning rate.

What kind of images do you have? Your tfrecords file is only about 77 KB. Are your images tiny? How many are there? Can you post examples?

hkuyam008 commented 6 years ago

You are right , the error trace suggests that the loss is exploding...

INFO:tensorflow:step: 1622, file: b'TrainImg01.png', train_loss: 676.6780395507812, in 2.88s INFO:tensorflow:step: 1623, file: b'TrainImg02.png', train_loss: 388.91082763671875, in 2.97s

I am currently trying to move to a faster GPU based hardware to reducing training time and then I will probably try your suggestion of lowering the learning rate.

My images are simply a bunch of different web page screenshots. Attached one such image for reference. Mostly white background with bunch of UI Controls, I am guessing that's why the tfrecord file is small in size.

On a side note, I was able to complete a couple of training runs yesterday without the error. Guess the issue is at best intermittent

trainimg02

hkuyam008 commented 6 years ago

Since This error did not materialize after I moved to a faster GPU based hardware, hence closing this issue

tryolabs / luminoth

Nan Error while training on coco/pascalvoc/custom dataset after random number of steps #215