Bug in `GatherLayer.backward`

lightly-ai / lightly

A python library for self-supervised learning on images.

https://docs.lightly.ai/self-supervised-learning/

MIT License

2.83k stars 246 forks source link

Bug in `GatherLayer.backward` #1528

Closed ordabayevy closed 2 months ago

ordabayevy commented 2 months ago

Hi,

We've been implementing a model at cellarium-ml using your NTXentLoss. Comparing the model training with a single GPU and two GPUs we noticed that they do not match. By investigating it we found an apparent bug in the GatherLayer.backward where gradients are not sum-reduced over GPUs. Here is our fixed version (https://github.com/cellarium-ai/cellarium-ml/blob/main/cellarium/ml/distributed/gather.py#L17-L21):

    @staticmethod
    def backward(ctx, *grads) -> torch.Tensor:
        grad_out = grads[dist.get_rank()].contiguous()
        dist.all_reduce(grad_out, op=dist.ReduceOp.SUM)
        return grad_out

and the test we wrote. Would you agree that this is indeed a bug? I would be happy to contribute a PR with the fix.

guarin commented 2 months ago

Hi, thanks a lot for reporting the issue!

There seems to be indeed something wrong with our GatherLayer implementation. I was able to reproduce the problem and will look more into it in the coming days. A PR with the fix would be very welcome!

Btw. there might be something off with your test. The comments in the code indicate:

    # loss_0 = w_0 + w_1
    # loss_1 = w_0 + w_1

    # dloss/dw_0 = dloss_0/dw_0 + dloss_1/dw_0 = 2
    # dloss/dw_1 = dloss_0/dw_1 + dloss_1/dw_1 = 2

But AFAIK loss = w_0 + w_1 and not loss = loss_0 + loss_1 which means that dloss/dw_0 = dw_0/dw_0 + dw_1/dw_0 = 1 and not 2.

ordabayevy commented 2 months ago

Hi, thanks for responding!

But AFAIK loss = w_0 + w_1 and not loss = loss_0 + loss_1 which means that dloss/dw_0 = dw_0/dw_0 + dw_1/dw_0 = 1 and not 2.

The point I was trying to make in this test is that w_0 contributes to both loss_0 and loss_1. Therefore, we need to accumulate gradients from both losses. Similarly, w_1 contributes to both loss_0 and loss_1. And since w_0 and w_1 are equivalent to a w in a single GPU case, their gradients should be the same (w_0.grad == w_1.grad == w.grad == 2).

MalteEbner commented 2 months ago

I created a PR with a test and a fix for it: https://github.com/lightly-ai/lightly/pull/1531