Closed AntoineRichard closed 6 years ago
Hi,
I took a look at your problem and I did not find any problem. Assuming you downloaded the train and validation data, and the resnet checkpoint files, you should train from scratch with no problems.
Thanks.
Hi,
Sorry for the troubles, I retried on a fresh install of tensorflow 1.7, cuda 9.0, cudnn 7.0, and it looks like it works now...
Thanks !
Hi,
So basically I just launched your code using Tensorflow 1.7.0 on your data (I downloaded them from the media fire link), and I tried to retrain the network using the resnet v2 50 and I get NaN losses when testing... I figured it has something to do with the is_training placeholder: -If set to False the output from the network (softmax, the "logits_tf" op (?)) is a 513x513x21 matrix filled with NaN values. -if set to True it is okay.
I am training on a GTX 1080Ti, and I used the following command:
python train.py --starting_learning_rate=0.00001 --batch_norm_decay=0.997 --crop_size=513 --gpu_id=0 --resnet_model=resnet_v2_50
Do you know why it does that, feels like it's batchnorm related as always with tensorflow... I also tried with tensorflow 1.8, it did not work.
Thanks in advance,
Antoine