Loss or weight nan error on ADE dataset

lcybuzz commented 5 years ago

I tried to train on ADE dataset, but I still met the error proposed in #15 . There are two differences with the example script (3.b):

I used --batch_size 2 --gpu_num 4 because of GPU memory limitation. But I decrease the --lrn_rate to 0.00001 as suggested in #15 .
I used resnet_v1_101 network and resnet_v1_101.ckpt as the pretrained model.

My Tensorflow is 1.8.0. Any idea about this error? Thanks!

holyseven commented 5 years ago

I tried tensorflow 1.8 and 1.12, py2 and py3, all work fine. See the log below.

Check the dataset (http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip) and the pretrained model (http://download.tensorflow.org/models/resnet_v1_101_2016_08_28.tar.gz).

python ./train.py --batch_size 2 --gpu_num 4 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.0001 --train_max_iter 60000 --snapshot 30000 --random_rotate 0 --database 'ADE' --train_image_size 480 --test_image_size 480 --network 'resnet_v1_101' --fine_tune_filename './z_pretrained_weights/resnet_v1_101.ckpt'
GPU devices:  0,1,2,3
{'batch_size': 2, 'blur': 1, 'bn_frozen': 0, 'color_switch': 0, 'consider_dilated': 0, 'data_format': 'NHWC', 'database': 'ADE', 'eval_only': 0, 'fine_tune_filename': './z_pretrained_weights/resnet_v1_101.ckpt', 'float_type': 32, 'gpu_num': 4, 'has_aux_loss': 1, 'initializer': 'he', 'loss_type': 'normal', 'lr_step': None, 'lrn_rate': 0.01, 'mirror': 1, 'momentum': 0.9, 'network': 'resnet_v1_101', 'new_layer_names': None, 'optimizer': 'mom', 'poly_lr': 1, 'random_rotate': 0, 'random_scale': 1, 'resume_step': None, 'save_first_iteration': 0, 'scale_max': 2.0, 'scale_min': 0.5, 'snapshot': 30000, 'step_size': 0.1, 'structure_in_paper': 0, 'subsets_for_training': 'train', 'test_image_size': 480, 'test_max_iter': None, 'train_image_size': 480, 'train_like_in_paper': 0, 'train_max_iter': 60000, 'weight_decay_mode': 0, 'weight_decay_rate': 0.0001, 'weight_decay_rate2': 0.0001}

< using tf.float32 >

Database has 20210 images.
applying random mirror ...
applying random scale [0.500000, 2.000000]...

< Resnet structure >

num_residual_units:  [3, 4, 23, 3]
rates in each atrous convolution:  [1, 1, 2, 4]
stride in each block:  [1, 2, 1, 1]
channels in each block:  [256, 512, 1024, 2048]
shape after pool1:  (2, 120, 120, 64)
shape after block 1:  (2, 120, 120, 256)
shape after block 2:  (2, 60, 60, 512)
aux_logits:  (2, 60, 60, 256)
upsampled auxiliary_x for loss function:  (2, 480, 480, 150)
shape after block 3:  (2, 60, 60, 1024)
pool6 pooled size:  (2, 6, 6, 512)
pool6 output size:  (2, 60, 60, 512)
pool3 pooled size:  (2, 3, 3, 512)
pool3 output size:  (2, 60, 60, 512)
pool2 pooled size:  (2, 2, 2, 512)
pool2 output size:  (2, 60, 60, 512)
pool1 pooled size:  (2, 1, 1, 512)
pool1 output size:  (2, 60, 60, 512)
shape after block 4:  (2, 60, 60, 512)
logits:  (2, 60, 60, 512)
logits after upsampling:  (2, 480, 480, 150)
normal cross entropy with softmax ... 

< weight decay info >

Applying L2 regularization...
============================================
=============== LogDir Info ================
log_dir ./log
database_dir ./log/ADE
exp_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1
snapshot_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1/snapshot
=============== LogDir Info ================
============================================

2019-01-29 09:56:40.924629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2019-01-29 09:56:42.236680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-29 09:56:42.236729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 
2019-01-29 09:56:42.236753: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y 
2019-01-29 09:56:42.236759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y 
2019-01-29 09:56:42.236762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y 
2019-01-29 09:56:42.236766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N 
< Finetuning Process: not import resnet_v1_101/block3/unit_24/weights:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/biases:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/weights:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/logits/biases:0 >
< Succesfully loaded fine-tune model from ./z_pretrained_weights/resnet_v1_101.ckpt. >

< training process begins >

2019-01-29 09:57:55.321205 39990] Step 20, lr = 0.009997, wd_rate = 0.000100, wd_rate_2 = 0.000100 
     loss = 5.5795, precision = 0.0129, wd = 0.6102
     estimated time left: 0.0 hours. 20/60000
2019-01-29 09:58:05.562760 39990] Step 40, lr = 0.009994, wd_rate = 0.000100, wd_rate_2 = 0.000100 
     loss = 4.4038, precision = 0.0200, wd = 0.6112
     estimated time left: 8.5 hours. 40/60000
2019-01-29 09:58:15.895888 39990] Step 60, lr = 0.009991, wd_rate = 0.000100, wd_rate_2 = 0.000100 
     loss = 3.7097, precision = 0.0241, wd = 0.6118
     estimated time left: 8.6 hours. 60/60000
2019-01-29 09:58:26.273976 39990] Step 80, lr = 0.009988, wd_rate = 0.000100, wd_rate_2 = 0.000100 
     loss = 3.5920, precision = 0.0270, wd = 0.6120
     estimated time left: 8.6 hours. 80/60000
2019-01-29 09:58:36.568903 39990] Step 100, lr = 0.009985, wd_rate = 0.000100, wd_rate_2 = 0.000100 
     loss = 3.6517, precision = 0.0302, wd = 0.6123
     estimated time left: 8.6 hours. 100/60000
2019-01-29 09:58:46.962572 39990] Step 120, lr = 0.009982, wd_rate = 0.000100, wd_rate_2 = 0.000100 
     loss = 3.5043, precision = 0.0301, wd = 0.6124
     estimated time left: 8.6 hours. 120/60000
2019-01-29 09:58:57.264132 39990] Step 140, lr = 0.009979, wd_rate = 0.000100, wd_rate_2 = 0.000100 
     loss = 3.5823, precision = 0.0321, wd = 0.6126
     estimated time left: 8.6 hours. 140/60000

lcybuzz commented 5 years ago

I re-download the ADE dataset and it works! However I find batch size * gpu_num should not be too few indeed. My training still fails occasionally for --batch_size 2 --gpu_num 4 on Tesla K80.

By the way, I find it takes more than 4 minutes for preprocessing before the first training iteration actually starts. I wonder if most time is spent on the multi-gpu mechanism?

holyseven commented 5 years ago

Usually it takes more time (2~3 minutes on my machine) than a single-GPU task. I think that tensorflow needs time to create graph forward and backward, GPU-GPU, CPU-GPU communications etc. I don't know if there is way to accelerating the graph creation. Update me if you have some ideas or directly pull request.

About the NAN error, however, even when I use --batch_size 1 --gpu_num 4, there is no NAN error at least for the first 100 iterations (repeated 5 times). I am not sure what happens. Let me know if you have any thoughts.

lcybuzz commented 5 years ago

I also tried the same setting on another server (Tesla M40) and it still worked fine for about 200 iters. I will study more on these issues.

Thanks for the codes and reply!

holyseven / PSPNet-TF-Reproduce

Loss or weight nan error on ADE dataset #19