Closed wadesunyang closed 3 years ago
Hi @vacancy , I ran all the 3 test files in pytorch1.0, all failed with close check.
For example, like this
F
======================================================================
FAIL: testNumericBatchNorm (__main__.NumericTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_numeric_batchnorm.py", line 51, in testNumericBatchNorm
self.assertTensorClose(b_var1.data, b_var2.data)
File "/home/ylzou/research/side_projects/Synchronized-BatchNorm-PyTorch/sync_batchnorm/unittest.py", line 28, in assertTensorClose
self.assertTrue(torch.allclose(x, y), message)
AssertionError: False is not true : Tensor close check failed
adiff=0.00015270709991455078
rdiff=0.014421123079955578
----------------------------------------------------------------------
Ran 1 test in 0.373s
System info:
Is it related to CUDA version?
Hi @Yuliang-Zou The error information you posted is related to numeric stability. It is a known issue that our implementation has poorer numeric stability than the original BN implementation from PyTorch. This does not hurt model performance in our empirical experiments. See also the README for details.
Thanks for the quick reply! I am training a segmentation model and will compare the performance with original BN.
Did you compare the performance @Yuliang-Zou ?
Hi @Venka97 , I used it for a DeepLabv3, I found that the performance is slightly worse than using 1 GPU with unsync BN, but the difference seems acceptable. Not sure if the performance difference will increase when I increase the number of GPUs.
I am training DeepLabv3+ using Sync-BN. I'll try using nn.Batchnorm2d and update. It'll take some time though.
I see that the rdiff
is consistently more than adiff
. Is that a red flag? My rdiff
flag is inf
in the test test_numeric_batchnorm_v2.py
.
$ python test_numeric_batchnorm.py
AssertionError: False is not true : Tensor close check failed
adiff=0.00020062923431396484
rdiff=0.005438284948468208
----------------------------------------------------------------------
Ran 1 test in 0.089s
FAILED (failures=1)
$ python test_numeric_batchnorm_v2.py
F
======================================================================
adiff=5.8075333981832955e-06
rdiff=692.86572265625
----------------------------------------------------------------------
$ python test_sync_batchnorm.py
F...F
======================================================================
FAIL: testSyncBatchNorm2DSyncTrain (__main__.SyncTestCase)
----------------------------------------------------------------------
adiff=0.0001010894775390625
rdiff=17.859498977661133
======================================================================
FAIL: testSyncBatchNormSyncTrain (__main__.SyncTestCase)
----------------------------------------------------------------------
AssertionError: False is not true : Tensor close check failed
adiff=0.00016617774963378906
rdiff=0.013182982802391052
----------------------------------------------------------------------
Ran 5 tests in 11.977s
FAILED (failures=2)
Hi @Yuliang-Zou The error information you posted is related to numeric stability. It is a known issue that our implementation has poorer numeric stability than the original BN implementation from PyTorch. This does not hurt model performance in our empirical experiments. See also the README for details.
Why not adjust atol and rtol passed to torch.allclose to make the tests pass?
Yes, these are known issues as stated in the README... Since my early script does not check the gradients on BatchNorm weights, so I didn't realize that the relative difference is as large as this.
Anyway, after almost two years, I finally got some time to check it... I just submitted an issue to PyTorch.
In short, this unexpected gradient mismatch occurs when:
x.shape == [B, C, H, W]
x = x - x.mean(dims=["B", "H", "W"])
y = gamma * x + beta
loss = y.sum()
the gradient will be very off...
https://github.com/pytorch/pytorch/issues/53488
But it is worth noting that, this behavior rarely happens in real-world scenarios, because it occurs only when the output of the batchnorm layer receives a constant gradient (1 in the above case). In practice, these are gradients on the output feature map, so they won't be constant.
Hi @wadesunyang Is there any specific issue I can help with?