freeze_bn seems to be an invalid option

wangruohui commented 2 years ago

Dear author,

I am trying to read and reproduce your codes, but I found some possible issue with batch normalization.

In current code, you define a freeze_bn() function to change all batch normalization layers to eval mode, like

https://github.com/poppinace/indexnet_matting/blob/4beb06a47db2eecca87b8003a11f0b268506cea3/scripts/hlmobilenetv2.py#L824

But you neither rewrite the train function of nn.Module nor call this function every time before the training cycle.

This means when train() function call net.train(), these BN layers becomes training mode again and freeze_bn actually takes no effect and all training is conducted under with BN enabled. Is this right?

poppinace commented 2 years ago

Dear author,

I am trying to read and reproduce your codes, but I found some possible issue with batch normalization.

In current code, you define a freeze_bn() function to change all batch normalization layers to eval mode, like

https://github.com/poppinace/indexnet_matting/blob/4beb06a47db2eecca87b8003a11f0b268506cea3/scripts/hlmobilenetv2.py#L824

But you neither rewrite the train function of nn.Module nor call this function every time before the training cycle.

This means when train() function call net.train(), these BN layers becomes training mode again and freeze_bn actually takes no effect and all training is conducted under with BN enabled. Is this right?

Hi, net.train() does not affect the setting of eval model of bn layers if it is set using the freeze_bn() function in the code. You can check the status of bn layers during training.

wangruohui commented 2 years ago

Hello,

Thanks for your quick response. I did some checking but the results show that BN is in training mode.

I made a fork to your repository and add some codes to check the running mean/var of bn during training, as here.

However, with current implementation, the results show that the running mean is still being updated during training which means the BN is in train mode.

module.layer0.1.running_mean tensor([ 0.0036, -0.0038,  0.0019,  0.0096, -0.0123,  0.0645, -0.0039, -0.0033,
         0.0043,  0.0067,  0.0018,  0.0469, -0.0557, -0.0310,  0.0132, -0.0124,
         0.0022,  0.0046, -0.0369, -0.0028, -0.0050, -0.0080,  0.0019,  0.0060,
         0.0052, -0.0040, -0.0138, -0.0289, -0.0096,  0.0213, -0.0068,  0.0069],
       device='cuda:0')
epoch: 1, train: 1/10775, loss: 0.65694, frame: 1.21Hz/1.21Hz
module.layer0.1.running_mean tensor([ 0.0068, -0.0150,  0.0079,  0.0414, -0.0621,  0.1761, -0.0140, -0.0060,
         0.0165,  0.0259,  0.0073,  0.1927, -0.2286, -0.0602,  0.0484, -0.0455,
         0.0065,  0.0189, -0.1496, -0.0099, -0.0214, -0.0488,  0.0089,  0.0518,
         0.0236, -0.0177, -0.0556, -0.1153, -0.0569,  0.0842, -0.0276,  0.0240],
       device='cuda:0')
epoch: 1, train: 2/10775, loss: 0.46729, frame: 2.69Hz/1.95Hz
module.layer0.1.running_mean tensor([ 0.0101, -0.0301,  0.0155,  0.0893, -0.1588,  0.2140, -0.0270, -0.0085,
         0.0359,  0.0477,  0.0153,  0.3849, -0.4547, -0.0492,  0.0886, -0.0884,
         0.0118,  0.0393, -0.2979, -0.0194, -0.0458, -0.0328,  0.0181,  0.0438,
         0.0500, -0.0376, -0.1109, -0.2188, -0.1717,  0.1630, -0.0560,  0.0464],
       device='cuda:0')
epoch: 1, train: 3/10775, loss: 0.38356, frame: 2.57Hz/2.16Hz
module.layer0.1.running_mean tensor([ 0.0122, -0.0392,  0.0198,  0.1183, -0.2181,  0.2257, -0.0345, -0.0102,
         0.0471,  0.0608,  0.0198,  0.5007, -0.5906, -0.0325,  0.1124, -0.1140,
         0.0146,  0.0517, -0.3863, -0.0254, -0.0606, -0.0288,  0.0238,  0.0453,
         0.0662, -0.0498, -0.1439, -0.2812, -0.2416,  0.2102, -0.0730,  0.0595],
       device='cuda:0')
epoch: 1, train: 4/10775, loss: 0.33935, frame: 2.75Hz/2.31Hz

If I comment out net.train(), these variables keep constant, like:

alchemy start...
module.layer0.1.running_mean tensor([ 0.0036, -0.0038,  0.0019,  0.0096, -0.0123,  0.0645, -0.0039, -0.0033,
         0.0043,  0.0067,  0.0018,  0.0469, -0.0557, -0.0310,  0.0132, -0.0124,
         0.0022,  0.0046, -0.0369, -0.0028, -0.0050, -0.0080,  0.0019,  0.0060,
         0.0052, -0.0040, -0.0138, -0.0289, -0.0096,  0.0213, -0.0068,  0.0069],
       device='cuda:0')
epoch: 1, train: 1/10775, loss: 0.64172, frame: 1.20Hz/1.20Hz
module.layer0.1.running_mean tensor([ 0.0036, -0.0038,  0.0019,  0.0096, -0.0123,  0.0645, -0.0039, -0.0033,
         0.0043,  0.0067,  0.0018,  0.0469, -0.0557, -0.0310,  0.0132, -0.0124,
         0.0022,  0.0046, -0.0369, -0.0028, -0.0050, -0.0080,  0.0019,  0.0060,
         0.0052, -0.0040, -0.0138, -0.0289, -0.0096,  0.0213, -0.0068,  0.0069],
       device='cuda:0')
epoch: 1, train: 2/10775, loss: 0.45836, frame: 2.70Hz/1.95Hz

Would you please have a look at that? Or I missed something?

poppinace commented 2 years ago

ok, have you fixed the issue? Do you see improved performance with fronzen bn?

poppinace / indexnet_matting

freeze_bn seems to be an invalid option #35