Closed zdluffy closed 5 years ago
I did the experiments without regularization, using 2*8 batch size, random_rotate 0, image_size 816, iter 30K, getting 75.5 mIoU (ss); while L2 and L2-SP getting around 76.7 mIoU. (The performance for L2 and L2-SP is different from the results in the table. I might be wrong for the results in that table. I will correct them later. But it is not the issue here.)
I cannot answer your question with certainty. I guess, with one gpu (batch size 8?), you might train the model with the same iterations (30K?). Comparing with two gpus (batch size 8*2), and without comparing the side effect for batch normalization layers, reducing batch size is equal to an early stopping. Knowing that the regularization restrains the optimization for the interest of the real loss function, L2 or L2-SP training process may not converge very well because of the "hypothetically equal" early stopping.
On the other hand, (this is pretty sure,) the regularization can help to improve the SOTA performance.
For the hyper-parameters for L2-SP in Cityscapes, my experience is that weight_decay_rate_2
can be usually larger than or equal to weight_decay_rate
. For example, what I set for L2-SP is --weight_decay_rate 0.0001 --weight_decay_rate2 0.001
, or --weight_decay_rate 0.001 --weight_decay_rate2 0.001
. Both of these settings can get good results.
Got it! Thank you very much!
I am getting nan
, can You help me? and have used resNet50
as you shared the link.
@root-sudip See if this link can help you (https://github.com/holyseven/PSPNet-TF-Reproduce/issues/19#issuecomment-458462816). Just for details, could you tell me about your GPU type, and the script of running the training?
Feel free to open a new issue if existing ones do not match your problem.
@holyseven I am using NVIDIA TESLA P6
GPU and Python3 to train the network.
To get the segmentation mask, I have used your this code. It is working fine, after some iterations around 1k, I am getting nan
loss.
Thanks @holyseven for reply.
@root-sudip which database and which hyperparameters you were using? Could you give me something like this:
python ./run.py --network 'resnet_v1_50' --visible_gpus '0,1' --reader_method 'queue' --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.0001 --database 'ADE' --subsets_for_training 'train' --batch_size 8 --train_image_size 480 --snapshot 10000 --train_max_iter 60000 --test_image_size 480 --fine_tune_filename './z_pretrained_weights/resnet_v1_50.ckpt'
@holyseven Sure, Here it is, I have used Cityscapes dataset to train the network.
python3 ./run.py --network 'resnet_v1_50' --visible_gpus '0' --reader_method 'queue' --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.0001 --database 'Cityscapes' --subsets_for_training 'train' --batch_size 8 --train_image_size 480 --snapshot 10000 --train_max_iter 60000 --test_image_size 480 --fine_tune_filename './z_pretrained_weights/resnet_v1_50.ckpt
@root-sudip I repeated your training process but didn't find the nan problem. Although the train_image_size
and test_image_size
should be set larger for Cityscapes, it won't be a problem for training. This is what I got:
2019-03-17 10:39:42.496338 79650] Step 1320, lr = 0.009802, wd_mode = 0, wd_rate = 0.000100, wd_rate_2 = 0.000100
loss = 0.38199526, aux_loss = 0.16923133, weight_decay = 1.6719286, Select_1 = 0.29464543,
Estimated time left: 8.62 hours. 1320/60000
Try these:
Ok, I am doing the training again,
For the database,
to be confirmed I am telling you here,
Suppose there has a gt aachen_000000_000019_gtFine_polygons.json
and with it I have three .png
files(for each corresponding gt .json file) in database/cityscapes/gt/train/*/
as you suggest in your code.
For a example :
1)aachen_000000_000019_gtFine_color.png
2)aachen_000000_000019_gtFine_instanceids.png
3)aachen_000000_000019_gtFine_labelids.png
these are files for aachen_000000_000019_gtFine_polygons.json
. These all are in database/cityscapes/gt/train/*/
.
So, to take only color mask I have changed the line(Line number 36) in your code(database/reader.py) ,
labels_filename_proto = data_dir + '/gt/' + data_sub + '/*/*_color.png' (in database/reader.py
).
Here above line I have used to take only *_color.png
images.
Am I right @holyseven ??
It seems that you didn't do the createTrainIdLabelImgs thing. (This link is correct for image segmentation. The link in my previous comment is for instance segmentation.)
It will generate a fourth png
file, ended with *labelTrainIds.png
and that is for training.
Verify the generated files, and then change the L36 code in database/reader.py
to
labels_filename_proto = data_dir + '/gt/' + data_sub + '/*/*_labelTrainIds.png'
Ok, can I mail you to verify generated *_labelTrainIds.png image? ( to your hotmail id)
OK. Feel free to open a new issue if existing ones do not match your problem.
Ok, I am sending you a file, and then If I face same problem I will open a issue, @holyseven Thanks
I trained resnet-50 on Cityscapes with one gpu for three times that only different on the strategy of weight decay. The other parameters are the same as in your example. When I apply L2-SP regularization, the precision on val_set is 69.93 mIoU. When I apply L2 regularization, the precision on val_set is 69.68 mIoU. When I apply No regularization, the precision on val_set is 72.15 mIoU. So, my question is does loss regularization really work? Why the highest accuracy occurs with no loss regularization? Or how can I change the hyper-parameters to improve the accuracy of L2-SP regularization?