(In architecture search) we always use batch-specific statistics for batch normalization rather than the global moving average.
Learnable affine parameters in all batch normalizations are disabled during the search process
The second statement holds true in this implementation, but I couldn't find related code for the first statement. According to my observation, all batch normalizations use global moving average.
For example, one batch norm layer has this form,
BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=False)
I think momentum (or decay in fact) should be set to 1 in the above to be consistent with the paper.
According to the original paper
The second statement holds true in this implementation, but I couldn't find related code for the first statement. According to my observation, all batch normalizations use global moving average. For example, one batch norm layer has this form,
BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=False)
I think momentum (or decay in fact) should be set to 1 in the above to be consistent with the paper.