Open 1292765944 opened 7 years ago
This is to make our learning rate '#gpu irrelevant'. Specifically, gradients are summed without rescaling over GPUs, therefore, the effective learning rate will be cfg.TRAIN.lr * num_gpu
.
On the contrary, if we use batch size as the value of rescale_grad, then we need to set cfg.TRAIN.lr = 2e-3
for 4gpus and cfg.TRAIN.lr = 1e-3
for 8gpus explicitly in yaml file.
Thank you for immediate reply. Now I want to reproduce your experiment on COCO. However, I only have 2 gpus. While your default yaml file uses 8 gpus. So what should I do to get similar results? Maybe, I should keep rescale_grad = 1.0, then I need to increase the learning rate, from 5e-4 to 2e-3, change the batch_size from 1image/gpu to 4images/gpu (How to implement this in mxnet? I know we can use iter_size in caffe.) Is there anything I left?
Just using lr=5e-4 and 1image/gpu should produce the similar result.
I tried the method as you show, only changing gpu=0,1. However, I observe a little lower mAP (3%) than your paper claims. So I wonder whether we should adjust the lr and batch_size accordingly to improve my result.
Why the parameter rescale_grad is fixed to 1.0 https://github.com/msracver/FCIS/blob/master/fcis/train_end2end.py#L159 While I think this parameter should keep consistent with input_batch_size, am I right?