vacancy / Synchronized-BatchNorm-PyTorch

Synchronized Batch Normalization implementation in PyTorch.
MIT License
1.5k stars 189 forks source link

Can't test successfully using the scripts in ./tests #20

Closed Yochengliu closed 5 years ago

Yochengliu commented 5 years ago

Thanks for this excellent project, but I have problems to test it successfully.

1. I first run the scripts in ./tests, the errors are as follows:

(1) test_numeric_batchnorm.py

ERROR: testNumericBatchNorm (main.NumericTestCase) Traceback (most recent call last): File "test_numeric_batchnorm.py", line 48, in testNumericBatchNorm self.assertTensorClose(bn.running_mean, a.mean(dim=0)) File "/home/liuyongcheng/3dcls/scannet/embed/scancls_embed13/syncbn/unittest.py", line 28, in assertTensorClose self.assertTrue(torch.allclose(x, y), message) AttributeError: module 'torch' has no attribute 'allclose' Ran 1 test in 0.192s FAILED (errors=1)

(2) test_numeric_batchnorm_v2.py

ERROR: testNumericBatchNorm (main.NumericTestCasev2) Traceback (most recent call last): File "test_numeric_batchnorm_v2.py", line 33, in testNumericBatchNorm batchnorm2 = BatchNorm2dReimpl(CHANNELS, momentum=1) File "/home/liuyongcheng/3dcls/scannet/embed/scancls_embed13/syncbn/batchnorm_reimpl.py", line 33, in init self.weight = nn.Parameter(torch.empty(num_features)) AttributeError: module 'torch' has no attribute 'empty' Ran 1 test in 0.001s FAILED (errors=1)

(3) test_sync_batchnorm.py

ERROR: testSyncBatchNorm2DSyncTrain (main.SyncTestCase) Traceback (most recent call last): File "test_sync_batchnorm.py", line 107, in testSyncBatchNorm2DSyncTrain self._checkBatchNormResult(bn, sync_bn, torch.rand(16, 10, 16, 16), True, cuda=True) File "test_sync_batchnorm.py", line 59, in _checkBatchNormResult output2.sum().backward() File "/home/liuyongcheng/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/liuyongcheng/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed.

ERROR: testSyncBatchNormNormalEval (main.SyncTestCase) Traceback (most recent call last): File "test_sync_batchnorm.py", line 77, in testSyncBatchNormNormalEval self._checkBatchNormResult(bn, sync_bn, torch.rand(16, 10), False) File "test_sync_batchnorm.py", line 61, in _checkBatchNormResult self.assertTensorClose(input1.data, input2.data) File "/home/liuyongcheng/3dcls/scannet/embed/scancls_embed13/syncbn/unittest.py", line 28, in assertTensorClose self.assertTrue(torch.allclose(x, y), message) AttributeError: module 'torch' has no attribute 'allclose'

ERROR: testSyncBatchNormNormalTrain (main.SyncTestCase) Traceback (most recent call last): File "test_sync_batchnorm.py", line 71, in testSyncBatchNormNormalTrain self._checkBatchNormResult(bn, sync_bn, torch.rand(16, 10), True) File "test_sync_batchnorm.py", line 61, in _checkBatchNormResult self.assertTensorClose(input1.data, input2.data) File "/home/liuyongcheng/3dcls/scannet/embed/scancls_embed13/syncbn/unittest.py", line 28, in assertTensorClose self.assertTrue(torch.allclose(x, y), message) AttributeError: module 'torch' has no attribute 'allclose'

ERROR: testSyncBatchNormSyncEval (main.SyncTestCase) Traceback (most recent call last): File "test_sync_batchnorm.py", line 97, in testSyncBatchNormSyncEval self._checkBatchNormResult(bn, sync_bn, torch.rand(16, 10), False, cuda=True) File "test_sync_batchnorm.py", line 61, in _checkBatchNormResult self.assertTensorClose(input1.data, input2.data) File "/home/liuyongcheng/3dcls/scannet/embed/scancls_embed13/syncbn/unittest.py", line 28, in assertTensorClose self.assertTrue(torch.allclose(x, y), message) AttributeError: module 'torch' has no attribute 'allclose'

ERROR: testSyncBatchNormSyncTrain (main.SyncTestCase) Traceback (most recent call last): File "test_sync_batchnorm.py", line 87, in testSyncBatchNormSyncTrain self._checkBatchNormResult(bn, sync_bn, torch.rand(16, 10), True, cuda=True) File "test_sync_batchnorm.py", line 59, in _checkBatchNormResult output2.sum().backward() File "/home/liuyongcheng/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/liuyongcheng/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed. Ran 5 tests in 6.546s FAILED (errors=5)

2. I secondly run my scripts using net = DataParallelWithCallback(net, device_ids=[0, 1]) with two GPUs (single GPU is all right), the error is:

Traceback (most recent call last): File "train_scan_em13.py", line 190, in loss.backward() File "/home/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed.

Ubuntu 14.04

cuda8.0 & cudnn5.1

python3.6

pytorch 0.3.1

Do you have any suggestions? Thanks.

vacancy commented 5 years ago

This is a bug of PyTorch: https://github.com/pytorch/pytorch/issues/3883 . Please upgrade your pytorch version.

vacancy commented 5 years ago

Close for now. Feel free to reopen.