Training customised data set. Getting loss or weight norm is nan. Training Stopped!

lakshmankanakala commented 5 years ago

I have converted my dataset as ADE format annotation images.

I have only two classes. like wise i will have only two pixel values 1 and 2 in annotation image rest all pixels will have 0.

I have used this command

python ./run.py --network 'resnet_v1_50' --visible_gpus '0,1' --reader_method 'queue' --lrn_rate 0.0001 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.001 --database 'ADE' --subsets_for_training 'train' --batch_size 2 --train_image_size 480 --snapshot 30000 --train_max_iter 90000 --test_image_size 480 --random_rotate 0 --fine_tune_filename './z_pretrained_weights/resnet_v1_50.ckpt'

After some iterations (650) , I am getting following error

loss or weight norm is nan. Training Stopped!

I have seen issue #15 , but not worked.

I think data set representation would be wrong. like number of classes.

is there any way to check my custom data set in correct ADE format or not.

Please help me out, Thanks

holyseven commented 5 years ago

The number of class in the 'ADE' dataset is 150, so will cause some problems for your customized dataset.

The simplest way is to modify this line of code to 3 (0, 1, 2). Of course, the original ADE dataset can not be used any more if you modified that line.

Or you can write a reader for your dataset.

One more thing, better use a larger batch size (larger than 8).

lakshmankanakala commented 5 years ago

Thanks for the reply @holyseven . I have done as per your suggestion. I am training on my custom data. but unfortunately my system shut downed after 30,000 iterations.

Can I resume the training from the latest checkpoint. I have not found any input arguments for resume. Is there any modifications need to add ? please help me out.

Thanks.

holyseven commented 5 years ago

You can modify the line about the saver of checkpoints, and the argument FLAGS.snapshot, to save the checkpoints.

Then, loading the latest checkpoint has nothing special. For example, see https://github.com/hellochick/PSPNet-tensorflow/blob/master/train.py#L189-L196

lakshmankanakala commented 5 years ago

Thanks for the response. I will do that.

holyseven / PSPNet-TF-Reproduce

Training customised data set. Getting loss or weight norm is nan. Training Stopped! #26