Functionality difference between pytorch batchnorm and synchronised batchnorm

debapriyamaji commented 4 years ago

Hi, Thanks a lot for sharing the code. I wanted to export an ONNX model. Hence, I replaced all the synchronized batchnorm with pytorch's batch-norm. However, I observed huge drop in accuracy(~20%). When I dig deeper, I realized that inside the batch-norm kernel, you are taking the absolute value of the weights and adding eps to it. This is functionally different from pytorch's batch-norm.

What is the reason behind this slightly different implementation of batch-norm? Does it help in training or something else?

juntang-zhuang commented 4 years ago

Hi, thanks for your interest. The bn layer is forked from other repos, please see readme for each branch. I used two versions of syncbn from two repos, syncbn in "citys" branch is from https://github.com/zhanghang1989/PyTorch-Encoding, syncbn in "citys-lw" is from "https://github.com/CoinCheung/BiSeNet". I'm not sure about the details, I guess it's implemented according to "https://hangzhang.org/PyTorch-Encoding/notes/syncbn.html", maybe you can ask questions in the original repo using syncbn.

BTW, could you specify where is the difference "taking the absolute value of the weights and adding eps to it"? My guess is *sign(w) (abs(w) + eps) is numerically stable than w+eps . When w is negative (assuming w can be negative, with leaky-relu for example), w+eps** could push w closer to 0, or even change the sign.

debapriyamaji commented 4 years ago

Hi Juntang, Thanks for the quick response.

I was referring to this part of the code in inplace_abn_cuda.cu, line 114:

template global void forward_kernel(T x, const T mean, const T var, const T weight, const T *bias, bool affine, float eps, int num, int chn, int sp) { int plane = blockIdx.x;

T _mean = mean[plane]; T _var = var[plane]; T _weight = affine ? abs(weight[plane]) + eps : T(1); T _bias = affine ? bias[plane] : T(0);

T mul = rsqrt(_var + eps) * _weight;

for (int batch = 0; batch < num; ++batch) { for (int n = threadIdx.x; n < sp; n += blockDim.x) { T _x = x[(batch chn + plane) sp + n]; T _y = (_x - _mean) mul + _bias; x[(batch chn + plane) sp + n] = _y; } } }

Here, T _weight = affine ? abs(weight[plane]) + eps : T(1);

This is quite different from pytorch's batchnorm implementation where we use the weight without any modification.

While using pytorch's batchnorm, if I perform the exact same operation to the weights before calling batchnorm or modify the weights while loading, I am able to replicate the accuracy.

Regards - Debapriya

poornimajd commented 4 years ago

@debapriyamaji ,I want to run the inference of the model on cpu,and may be replacing inplace abn batchnorm to torch batchnorm is the way to go.Could you please share your modified script,so that it will help me to run on CPU.Also did you try to run the model on cpu? Thanks in advance.

juntang-zhuang / ShelfNet

Functionality difference between pytorch batchnorm and synchronised batchnorm #16