is everyone test the code in pytorch 1.0 ?

vacancy commented 5 years ago

Hi @wadesunyang Is there any specific issue I can help with?

Yuliang-Zou commented 5 years ago

Hi @vacancy , I ran all the 3 test files in pytorch1.0, all failed with close check.

For example, like this

F
======================================================================
FAIL: testNumericBatchNorm (__main__.NumericTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_numeric_batchnorm.py", line 51, in testNumericBatchNorm
    self.assertTensorClose(b_var1.data, b_var2.data)
  File "/home/ylzou/research/side_projects/Synchronized-BatchNorm-PyTorch/sync_batchnorm/unittest.py", line 28, in assertTensorClose
    self.assertTrue(torch.allclose(x, y), message)
AssertionError: False is not true : Tensor close check failed
adiff=0.00015270709991455078
rdiff=0.014421123079955578

----------------------------------------------------------------------
Ran 1 test in 0.373s

System info:

Python3.6
1.0.1.post2
CUDA 8.0
CUDNN-8.0-v6.0

Is it related to CUDA version?

vacancy commented 5 years ago

Hi @Yuliang-Zou The error information you posted is related to numeric stability. It is a known issue that our implementation has poorer numeric stability than the original BN implementation from PyTorch. This does not hurt model performance in our empirical experiments. See also the README for details.

Yuliang-Zou commented 5 years ago

Thanks for the quick reply! I am training a segmentation model and will compare the performance with original BN.

Venka97 commented 5 years ago

Did you compare the performance @Yuliang-Zou ?

Yuliang-Zou commented 5 years ago

Hi @Venka97 , I used it for a DeepLabv3, I found that the performance is slightly worse than using 1 GPU with unsync BN, but the difference seems acceptable. Not sure if the performance difference will increase when I increase the number of GPUs.

Venka97 commented 5 years ago

I am training DeepLabv3+ using Sync-BN. I'll try using nn.Batchnorm2d and update. It'll take some time though.

ayooshkathuria commented 5 years ago

I see that the rdiff is consistently more than adiff. Is that a red flag? My rdiff flag is inf in the test test_numeric_batchnorm_v2.py.

$ python test_numeric_batchnorm.py 

AssertionError: False is not true : Tensor close check failed
adiff=0.00020062923431396484
rdiff=0.005438284948468208
----------------------------------------------------------------------
Ran 1 test in 0.089s

FAILED (failures=1)

$ python test_numeric_batchnorm_v2.py
F
======================================================================
adiff=5.8075333981832955e-06
rdiff=692.86572265625
----------------------------------------------------------------------
$ python test_sync_batchnorm.py 
F...F
======================================================================
FAIL: testSyncBatchNorm2DSyncTrain (__main__.SyncTestCase)
----------------------------------------------------------------------
adiff=0.0001010894775390625
rdiff=17.859498977661133

======================================================================
FAIL: testSyncBatchNormSyncTrain (__main__.SyncTestCase)
----------------------------------------------------------------------
AssertionError: False is not true : Tensor close check failed
adiff=0.00016617774963378906
rdiff=0.013182982802391052

----------------------------------------------------------------------
Ran 5 tests in 11.977s

FAILED (failures=2)

cannon commented 5 years ago

Hi @Yuliang-Zou The error information you posted is related to numeric stability. It is a known issue that our implementation has poorer numeric stability than the original BN implementation from PyTorch. This does not hurt model performance in our empirical experiments. See also the README for details.

Why not adjust atol and rtol passed to torch.allclose to make the tests pass?

vacancy commented 3 years ago

Yes, these are known issues as stated in the README... Since my early script does not check the gradients on BatchNorm weights, so I didn't realize that the relative difference is as large as this.

Anyway, after almost two years, I finally got some time to check it... I just submitted an issue to PyTorch.

In short, this unexpected gradient mismatch occurs when:

there is a tensor of multiple dimensions, say x.shape == [B, C, H, W]
the tensor values are centered at zero for one axis (C): x = x - x.mean(dims=["B", "H", "W"])
apply an affine transformation on the C axis: y = gamma * x + beta
sum up all values as the loss function (thus, this output should be very close to zero as well): loss = y.sum()
compute the gradient w.r.t. the gamma in affine

the gradient will be very off...

https://github.com/pytorch/pytorch/issues/53488

But it is worth noting that, this behavior rarely happens in real-world scenarios, because it occurs only when the output of the batchnorm layer receives a constant gradient (1 in the above case). In practice, these are gradients on the output feature map, so they won't be constant.

vacancy / Synchronized-BatchNorm-PyTorch

is everyone test the code in pytorch 1.0 ? #24