Open wangruohui opened 2 years ago
Dear author,
I am trying to read and reproduce your codes, but I found some possible issue with batch normalization.
In current code, you define a
freeze_bn()
function to change all batch normalization layers toeval
mode, likeBut you neither rewrite the
train
function ofnn.Module
nor call this function every time before the training cycle.This means when
train()
function callnet.train()
, these BN layers becomes training mode again andfreeze_bn
actually takes no effect and all training is conducted under with BN enabled. Is this right?
Hi, net.train()
does not affect the setting of eval
model of bn layers if it is set using the freeze_bn()
function in the code. You can check the status of bn layers during training.
Hello,
Thanks for your quick response. I did some checking but the results show that BN is in training mode.
I made a fork to your repository and add some codes to check the running mean/var of bn during training, as here.
However, with current implementation, the results show that the running mean is still being updated during training which means the BN is in train mode.
module.layer0.1.running_mean tensor([ 0.0036, -0.0038, 0.0019, 0.0096, -0.0123, 0.0645, -0.0039, -0.0033,
0.0043, 0.0067, 0.0018, 0.0469, -0.0557, -0.0310, 0.0132, -0.0124,
0.0022, 0.0046, -0.0369, -0.0028, -0.0050, -0.0080, 0.0019, 0.0060,
0.0052, -0.0040, -0.0138, -0.0289, -0.0096, 0.0213, -0.0068, 0.0069],
device='cuda:0')
epoch: 1, train: 1/10775, loss: 0.65694, frame: 1.21Hz/1.21Hz
module.layer0.1.running_mean tensor([ 0.0068, -0.0150, 0.0079, 0.0414, -0.0621, 0.1761, -0.0140, -0.0060,
0.0165, 0.0259, 0.0073, 0.1927, -0.2286, -0.0602, 0.0484, -0.0455,
0.0065, 0.0189, -0.1496, -0.0099, -0.0214, -0.0488, 0.0089, 0.0518,
0.0236, -0.0177, -0.0556, -0.1153, -0.0569, 0.0842, -0.0276, 0.0240],
device='cuda:0')
epoch: 1, train: 2/10775, loss: 0.46729, frame: 2.69Hz/1.95Hz
module.layer0.1.running_mean tensor([ 0.0101, -0.0301, 0.0155, 0.0893, -0.1588, 0.2140, -0.0270, -0.0085,
0.0359, 0.0477, 0.0153, 0.3849, -0.4547, -0.0492, 0.0886, -0.0884,
0.0118, 0.0393, -0.2979, -0.0194, -0.0458, -0.0328, 0.0181, 0.0438,
0.0500, -0.0376, -0.1109, -0.2188, -0.1717, 0.1630, -0.0560, 0.0464],
device='cuda:0')
epoch: 1, train: 3/10775, loss: 0.38356, frame: 2.57Hz/2.16Hz
module.layer0.1.running_mean tensor([ 0.0122, -0.0392, 0.0198, 0.1183, -0.2181, 0.2257, -0.0345, -0.0102,
0.0471, 0.0608, 0.0198, 0.5007, -0.5906, -0.0325, 0.1124, -0.1140,
0.0146, 0.0517, -0.3863, -0.0254, -0.0606, -0.0288, 0.0238, 0.0453,
0.0662, -0.0498, -0.1439, -0.2812, -0.2416, 0.2102, -0.0730, 0.0595],
device='cuda:0')
epoch: 1, train: 4/10775, loss: 0.33935, frame: 2.75Hz/2.31Hz
If I comment out net.train()
, these variables keep constant, like:
alchemy start...
module.layer0.1.running_mean tensor([ 0.0036, -0.0038, 0.0019, 0.0096, -0.0123, 0.0645, -0.0039, -0.0033,
0.0043, 0.0067, 0.0018, 0.0469, -0.0557, -0.0310, 0.0132, -0.0124,
0.0022, 0.0046, -0.0369, -0.0028, -0.0050, -0.0080, 0.0019, 0.0060,
0.0052, -0.0040, -0.0138, -0.0289, -0.0096, 0.0213, -0.0068, 0.0069],
device='cuda:0')
epoch: 1, train: 1/10775, loss: 0.64172, frame: 1.20Hz/1.20Hz
module.layer0.1.running_mean tensor([ 0.0036, -0.0038, 0.0019, 0.0096, -0.0123, 0.0645, -0.0039, -0.0033,
0.0043, 0.0067, 0.0018, 0.0469, -0.0557, -0.0310, 0.0132, -0.0124,
0.0022, 0.0046, -0.0369, -0.0028, -0.0050, -0.0080, 0.0019, 0.0060,
0.0052, -0.0040, -0.0138, -0.0289, -0.0096, 0.0213, -0.0068, 0.0069],
device='cuda:0')
epoch: 1, train: 2/10775, loss: 0.45836, frame: 2.70Hz/1.95Hz
Would you please have a look at that? Or I missed something?
ok, have you fixed the issue? Do you see improved performance with fronzen bn?
Dear author,
I am trying to read and reproduce your codes, but I found some possible issue with batch normalization.
In current code, you define a
freeze_bn()
function to change all batch normalization layers toeval
mode, likehttps://github.com/poppinace/indexnet_matting/blob/4beb06a47db2eecca87b8003a11f0b268506cea3/scripts/hlmobilenetv2.py#L824
But you neither rewrite the
train
function ofnn.Module
nor call this function every time before the training cycle.This means when
train()
function callnet.train()
, these BN layers becomes training mode again andfreeze_bn
actually takes no effect and all training is conducted under with BN enabled. Is this right?