Closed lcybuzz closed 5 years ago
I tried tensorflow 1.8 and 1.12, py2 and py3, all work fine. See the log below.
Check the dataset (http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip) and the pretrained model (http://download.tensorflow.org/models/resnet_v1_101_2016_08_28.tar.gz).
python ./train.py --batch_size 2 --gpu_num 4 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.0001 --train_max_iter 60000 --snapshot 30000 --random_rotate 0 --database 'ADE' --train_image_size 480 --test_image_size 480 --network 'resnet_v1_101' --fine_tune_filename './z_pretrained_weights/resnet_v1_101.ckpt'
GPU devices: 0,1,2,3
{'batch_size': 2, 'blur': 1, 'bn_frozen': 0, 'color_switch': 0, 'consider_dilated': 0, 'data_format': 'NHWC', 'database': 'ADE', 'eval_only': 0, 'fine_tune_filename': './z_pretrained_weights/resnet_v1_101.ckpt', 'float_type': 32, 'gpu_num': 4, 'has_aux_loss': 1, 'initializer': 'he', 'loss_type': 'normal', 'lr_step': None, 'lrn_rate': 0.01, 'mirror': 1, 'momentum': 0.9, 'network': 'resnet_v1_101', 'new_layer_names': None, 'optimizer': 'mom', 'poly_lr': 1, 'random_rotate': 0, 'random_scale': 1, 'resume_step': None, 'save_first_iteration': 0, 'scale_max': 2.0, 'scale_min': 0.5, 'snapshot': 30000, 'step_size': 0.1, 'structure_in_paper': 0, 'subsets_for_training': 'train', 'test_image_size': 480, 'test_max_iter': None, 'train_image_size': 480, 'train_like_in_paper': 0, 'train_max_iter': 60000, 'weight_decay_mode': 0, 'weight_decay_rate': 0.0001, 'weight_decay_rate2': 0.0001}
< using tf.float32 >
Database has 20210 images.
applying random mirror ...
applying random scale [0.500000, 2.000000]...
< Resnet structure >
num_residual_units: [3, 4, 23, 3]
rates in each atrous convolution: [1, 1, 2, 4]
stride in each block: [1, 2, 1, 1]
channels in each block: [256, 512, 1024, 2048]
shape after pool1: (2, 120, 120, 64)
shape after block 1: (2, 120, 120, 256)
shape after block 2: (2, 60, 60, 512)
aux_logits: (2, 60, 60, 256)
upsampled auxiliary_x for loss function: (2, 480, 480, 150)
shape after block 3: (2, 60, 60, 1024)
pool6 pooled size: (2, 6, 6, 512)
pool6 output size: (2, 60, 60, 512)
pool3 pooled size: (2, 3, 3, 512)
pool3 output size: (2, 60, 60, 512)
pool2 pooled size: (2, 2, 2, 512)
pool2 output size: (2, 60, 60, 512)
pool1 pooled size: (2, 1, 1, 512)
pool1 output size: (2, 60, 60, 512)
shape after block 4: (2, 60, 60, 512)
logits: (2, 60, 60, 512)
logits after upsampling: (2, 480, 480, 150)
normal cross entropy with softmax ...
< weight decay info >
Applying L2 regularization...
============================================
=============== LogDir Info ================
log_dir ./log
database_dir ./log/ADE
exp_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1
snapshot_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1/snapshot
=============== LogDir Info ================
============================================
2019-01-29 09:56:40.924629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2019-01-29 09:56:42.236680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-29 09:56:42.236729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 1 2 3
2019-01-29 09:56:42.236753: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N Y Y Y
2019-01-29 09:56:42.236759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1: Y N Y Y
2019-01-29 09:56:42.236762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2: Y Y N Y
2019-01-29 09:56:42.236766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3: Y Y Y N
< Finetuning Process: not import resnet_v1_101/block3/unit_24/weights:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/biases:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/weights:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/logits/biases:0 >
< Succesfully loaded fine-tune model from ./z_pretrained_weights/resnet_v1_101.ckpt. >
< training process begins >
2019-01-29 09:57:55.321205 39990] Step 20, lr = 0.009997, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 5.5795, precision = 0.0129, wd = 0.6102
estimated time left: 0.0 hours. 20/60000
2019-01-29 09:58:05.562760 39990] Step 40, lr = 0.009994, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 4.4038, precision = 0.0200, wd = 0.6112
estimated time left: 8.5 hours. 40/60000
2019-01-29 09:58:15.895888 39990] Step 60, lr = 0.009991, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 3.7097, precision = 0.0241, wd = 0.6118
estimated time left: 8.6 hours. 60/60000
2019-01-29 09:58:26.273976 39990] Step 80, lr = 0.009988, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 3.5920, precision = 0.0270, wd = 0.6120
estimated time left: 8.6 hours. 80/60000
2019-01-29 09:58:36.568903 39990] Step 100, lr = 0.009985, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 3.6517, precision = 0.0302, wd = 0.6123
estimated time left: 8.6 hours. 100/60000
2019-01-29 09:58:46.962572 39990] Step 120, lr = 0.009982, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 3.5043, precision = 0.0301, wd = 0.6124
estimated time left: 8.6 hours. 120/60000
2019-01-29 09:58:57.264132 39990] Step 140, lr = 0.009979, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 3.5823, precision = 0.0321, wd = 0.6126
estimated time left: 8.6 hours. 140/60000
I re-download the ADE dataset and it works! However I find batch size * gpu_num
should not be too few indeed. My training still fails occasionally for --batch_size 2 --gpu_num 4
on Tesla K80.
By the way, I find it takes more than 4 minutes for preprocessing before the first training iteration actually starts. I wonder if most time is spent on the multi-gpu mechanism?
Usually it takes more time (2~3 minutes on my machine) than a single-GPU task. I think that tensorflow needs time to create graph forward and backward, GPU-GPU, CPU-GPU communications etc. I don't know if there is way to accelerating the graph creation. Update me if you have some ideas or directly pull request.
About the NAN error, however, even when I use --batch_size 1 --gpu_num 4
, there is no NAN error at least for the first 100 iterations (repeated 5 times). I am not sure what happens. Let me know if you have any thoughts.
I also tried the same setting on another server (Tesla M40) and it still worked fine for about 200 iters. I will study more on these issues.
Thanks for the codes and reply!
I tried to train on ADE dataset, but I still met the error proposed in #15 . There are two differences with the example script (3.b):
I used
--batch_size 2 --gpu_num 4
because of GPU memory limitation. But I decrease the--lrn_rate
to0.00001
as suggested in #15 .I used
resnet_v1_101
network andresnet_v1_101.ckpt
as the pretrained model.My Tensorflow is 1.8.0. Any idea about this error? Thanks!