[Bug] Unit test fails on multi-GPU setup

Balandat commented 5 years ago

running test_swa.py on a device with multiple GPUs results in the following:

Test output:
> test_swa (test.test_swa.TestSWA) ... ERROR
>
> ======================================================================
> ERROR: test_swa (test.test_swa.TestSWA)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/data/users/balandat/fbsource/fbcode/buck-out/opt/gen/pytorch/contrib/test_torchcontrib#binary,link-tree/test/test_swa.py", line 313, in test_swa
>     lambda weight, bias: constructor([weight, bias]))
>   File "/data/users/balandat/fbsource/fbcode/buck-out/opt/gen/pytorch/contrib/test_torchcontrib#binary,link-tree/test/test_swa.py", line 238, in _test_basic_cases
>     constructor
>   File "/data/users/balandat/fbsource/fbcode/buck-out/opt/gen/pytorch/contrib/test_torchcontrib#binary,link-tree/test/test_swa.py", line 131, in _test_basic_cases_template
>     optimizer.step(fn)
>   File "/data/users/balandat/fbsource/fbcode/buck-out/opt/gen/pytorch/contrib/test_torchcontrib#binary,link-tree/torchcontrib/optim/swa.py", line 206, in step
>     loss = self.optimizer.step(closure)
>   File "/data/users/balandat/fbsource/fbcode/buck-out/opt/gen/pytorch/contrib/test_torchcontrib#binary,link-tree/torch/optim/lbfgs.py", line 427, in step
>     self._add_grad(t, d)
>   File "/data/users/balandat/fbsource/fbcode/buck-out/opt/gen/pytorch/contrib/test_torchcontrib#binary,link-tree/torch/optim/lbfgs.py", line 264, in _add_grad
>     p.data.add_(step_size, update[offset:offset + numel].view_as(p.data))
> RuntimeError: expected device cuda:1 and dtype Double but got device cuda:0 and dtype Double

Balandat commented 5 years ago

cc @izmailovpavel

izmailovpavel commented 5 years ago

Hi @Balandat, looking into this right now. I believe the issue is that L-BFGS optimizer requires all the parameters to be on the same GPU (see second warning here: https://pytorch.org/docs/stable/optim.html#torch.optim.LBFGS). I have tried replacing the SWA wrapper with the following simple wrapper that does nothing but mimics the SWA interface, and it fails on the same test:

class LBFGSWrapper:
               class LBFGSWrapper:
    def __init__(self, lbfgs):
        self.optimizer = lbfgs

    def step(self, *args):
        return self.optimizer.step(*args)

    def zero_grad(self, *args):
        return self.optimizer.zero_grad(*args)

    def swap_swa_sgd(self):
        pass        

    def update_swa(self):
        pass        

    def state_dict(self):
        return self.optimizer.state_dict()

    def load_state_dict(self, *args):
        return self.optimizer.load_state_dict(*args)

To fix this for now we can by replace the lines https://github.com/pytorch/contrib/blob/master/test/test_swa.py#L312-L313 with

ignore_multidevice = constructor == lbfgs_constructor
self._test_basic_cases(
    lambda weight, bias: constructor([weight, bias]),
    ignore_multidevice=ignore_multidevice)

pytorch / contrib

[Bug] Unit test fails on multi-GPU setup #24