Nan losses - Githubissues

AntoineRichard commented 6 years ago

Hi,

So basically I just launched your code using Tensorflow 1.7.0 on your data (I downloaded them from the media fire link), and I tried to retrain the network using the resnet v2 50 and I get NaN losses when testing... I figured it has something to do with the is_training placeholder: -If set to False the output from the network (softmax, the "logits_tf" op (?)) is a 513x513x21 matrix filled with NaN values. -if set to True it is okay.

I am training on a GTX 1080Ti, and I used the following command:

python train.py --starting_learning_rate=0.00001 --batch_norm_decay=0.997 --crop_size=513 --gpu_id=0 --resnet_model=resnet_v2_50

Do you know why it does that, feels like it's batchnorm related as always with tensorflow... I also tried with tensorflow 1.8, it did not work.

Thanks in advance,

Antoine

sthalles commented 6 years ago

Hi,

I took a look at your problem and I did not find any problem. Assuming you downloaded the train and validation data, and the resnet checkpoint files, you should train from scratch with no problems.

Thanks.

AntoineRichard commented 6 years ago

Hi,

Sorry for the troubles, I retried on a fresh install of tensorflow 1.7, cuda 9.0, cudnn 7.0, and it looks like it works now...

Thanks !

sthalles / deeplab_v3

Nan losses #36