hujie-frank / SENet

Squeeze-and-Excitation Networks
Apache License 2.0
3.38k stars 839 forks source link

The effect of lr_mult and decay_mult on accuracy #26

Closed 408550969 closed 7 years ago

408550969 commented 7 years ago

Excuse me,what would be the difference of accuracy if I didn't add param{lr_mult, and decay_mult} when training? What is the default value(lr_mult, and decay_mult)for Caffe?

408550969 commented 7 years ago

Sorry to bother you,This is my resnet-34 train.prototxt. Why did my accuracy drop after I added
param { lr_mult: 1.0 decay_mult: 1.0 } param { lr_mult: 2.0 decay_mult: 0 } ?

`name: "ResNet-34" layer { name: "data" type: "Data" top: "data" top: "label" include { phase: TRAIN } transform_param { mirror: true crop_size: 224 mean_value: 104 mean_value: 117 mean_value: 123 } data_param { source: "/media/cll/Seagate/ilsvrc12_train_lmdb" batch_size: 32 backend: LMDB } } layer { bottom: "data" top: "conv1" name: "conv1" type: "Convolution" param{ lr_mult:1.0 decay_mult:1.0 } param{ lr_mult:2.0 decay_mult:0 } convolution_param { num_output: 64 kernel_size: 7 pad: 3 stride: 2 weight_filler { type: "msra" } bias_filler{ type: "constant" value: 0.0 } } }

layer { bottom: "conv1" top: "conv1" name: "bn_conv1" type: "BatchNorm" param { lr_mult: 0 decay_mult: 0 } param { lr_mult: 0 decay_mult: 0 } param { lr_mult: 0 decay_mult: 0 } } layer { bottom: "conv1" top: "conv1" name: "scale_conv1" type: "Scale" scale_param { bias_term: true } }

layer { bottom: "conv1" top: "conv1" name: "conv1_relu" type: "ReLU" }

layer { bottom: "conv1" top: "pool1" name: "pool1" type: "Pooling" pooling_param { pool:MAX kernel_size: 3 stride: 2 } }

hujie-frank commented 7 years ago

The default value for lr_mult and decay_mult is 1. If a convolutional layer followed by a BatchNorm layer, you can remove the bias term in convolutional layer by means of bias_tem: false. And you'd better set the following setting in Scale layer.

param {
     lr_mult: 1
     decay_mult: 0
}
param {
     lr_mult: 1
     decay_mult: 0
}
408550969 commented 7 years ago

Thanks! the convolutional layer like follow? layer { bottom: "res5c_branch2a" top: "res5c_branch2b" name: "res5c_branch2b" type: "Convolution" param{ lr_mult:1.0 decay_mult:1.0 } param{ lr_mult:2.0 decay_mult:0 } convolution_param { num_output: 512 kernel_size: 3 pad: 1 stride: 1 weight_filler { type: "msra" } bias_tem: false } }

hujie-frank commented 7 years ago

You should remove the second param { ... } as the bias term is removed.

408550969 commented 7 years ago

just like this? layer { bottom: "res5c_branch2a" top: "res5c_branch2b" name: "res5c_branch2b" type: "Convolution" convolution_param { num_output: 512 kernel_size: 3 pad: 1 stride: 1 weight_filler { type: "msra" } bias_tem: false } }

hujie-frank commented 7 years ago

That's all right.

408550969 commented 7 years ago

May I ask why If a convolutional layer followed by a BatchNorm layer,we should set bias_tem: false?Could you give me some information? and why we should add param { lr_mult: 1 decay_mult: 0 } param { lr_mult: 1 decay_mult: 0 } in scale layer? Thank you very much!

hujie-frank commented 7 years ago

The BatchNorm layer subtracts mean values for each channel. If there is a bias term in the front convolutional layer, it will be subtracted by the following BN layer. In this case, the bias term becomes useless.

The Scale layer is used to adjust the normalized distribution produced by the BatchNorm layer. In general, we do not take it into consideration at parameter regularization.

408550969 commented 7 years ago

Thanks! some people use use_global_stats in type: "BatchNorm",in other version some people use param { lr_mult: 0 decay_mult: 0 } param { lr_mult: 0 decay_mult: 0 } param { lr_mult: 0 decay_mult: 0 } May I ask what the difference is? Which is better?

hujie-frank commented 7 years ago

In test stage, the use_global_stats should be true to use the statistic mean values from training stage. In training stage, the use_global_stats should be false to generate mean values for testing. More details will be described here.

408550969 commented 7 years ago

But why on Imagenet, when I was testing, false is higher than true?

hujie-frank commented 7 years ago

If the use_global_stats is false in testing, the result depends on the input batch size. And the result is worse than true in most case. As for your case, I think there is something wrong at the BatchNorm layer in the training stage.

408550969 commented 7 years ago

Yes, I found the problem. When the input batch size is 90, the accuracy is 3% higher than 16. Could you check my train.prototxt? thanks! 1.docx

hujie-frank commented 7 years ago

You forgot the following setting in the BatchNorm layer.

param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}

And you also forgot the following setting in the Scale layer.

param {
lr_mult: 1
decay_mult: 0
}
param {
lr_mult: 1
decay_mult: 0
}
408550969 commented 7 years ago

Let me have a try,thanks a lot!

408550969 commented 7 years ago

The accuracy is reduced, and the accuracy of false is still higher than true.

408550969 commented 7 years ago

I find the problem,at the initial stage of training, the accuracy of false was higher than that of true, but at the end of the training, true was higher than false.Now my resnet-34 accuracy can reach 71%, although it is still not as good as the paper.

408550969 commented 6 years ago

I am sorry to disturb you again and again,I have a few more questions. Why some layer has two param?like param { lr_mult: 1 decay_mult: 0 } param { lr_mult: 1 decay_mult: 0 } and some later has three param? The default value for lr_mult and decay_mult is 1. How many param are there by default?And what are the meanings of these param? And why in scale layer the decay_mult should be set to 0? Thank you very much!

hujie-frank commented 6 years ago

The number of param configuration in a specific layer should be equal to the number of parameters in that layer.
The lr_mult learning_rate is the actual learning rate of the parameter, as well as decay_mult (i.e. the weight_decay decay_mult is the actual weight_decay of the parameter). In BatchNorm layer, it has three parameters, which are mean, variance and factor. All of these parameters are updated by moving average strategy described in BN paper, rather than by the gradient descent. In Scale layer, we don't need to do parameter normalization for its parameters as the BN paper said.

408550969 commented 6 years ago

Thanks! And I have some other questions.What is the meaning of two parameters in scale layer(mean, variance and factor?)? In mobilenet,They added the param { lr_mult: 1 decay_mult: 1 } In the Convolution layer,and the bias_term is set to false. What is the meaning of this param? And the default value for lr_mult and decay_mult is 1,is it possible to remove this param?

hujie-frank commented 6 years ago

@408550969 Maybe you can read the BN paper again. In BVLC/caffe, the process of batch normalization is consists of BatchNormLayer and ScaleLayer. If each convolutional layer followed by a BN layer, the bias term for the previous convolutional layer is almost useless as the BN layer will subtract mean for each channel (i.e. Whatever the biases are given, they will eventually be subtracted). If the number of param{} in one layer in prototxt is less than the actual number of parameters, caffe will add a default param, using param {lr_mult: 1 decay_mult:1}.

408550969 commented 6 years ago

Thank you!

unclejokerjoker commented 5 years ago

The default value for lr_mult and decay_mult is 1. If a convolutional layer followed by a BatchNorm layer, you can remove the bias term in convolutional layer by means of bias_tem: false. And you'd better set the following setting in Scale layer.

param {
     lr_mult: 1
     decay_mult: 0
}
param {
     lr_mult: 1
     decay_mult: 0
}

why should we set lr_mult and decay_mult params in Scale layer ?I think in Scale layer we should set these two params to 0: param { lr_mult: 0 decay_mult: 0 } param { lr_mult: 0 decay_mult: 0 } Beacuse the params in Scale layer are not affected by the gradient descent.Are there any differences between setting to 1 and 0?