Closed ordabayevy closed 2 months ago
Hi, thanks a lot for reporting the issue!
There seems to be indeed something wrong with our GatherLayer
implementation. I was able to reproduce the problem and will look more into it in the coming days. A PR with the fix would be very welcome!
Btw. there might be something off with your test. The comments in the code indicate:
# loss_0 = w_0 + w_1
# loss_1 = w_0 + w_1
# dloss/dw_0 = dloss_0/dw_0 + dloss_1/dw_0 = 2
# dloss/dw_1 = dloss_0/dw_1 + dloss_1/dw_1 = 2
But AFAIK loss = w_0 + w_1
and not loss = loss_0 + loss_1
which means that dloss/dw_0 = dw_0/dw_0 + dw_1/dw_0 = 1
and not 2
.
Hi, thanks for responding!
But AFAIK loss = w_0 + w_1 and not loss = loss_0 + loss_1 which means that dloss/dw_0 = dw_0/dw_0 + dw_1/dw_0 = 1 and not 2.
The point I was trying to make in this test is that w_0
contributes to both loss_0
and loss_1
. Therefore, we need to accumulate gradients from both losses. Similarly, w_1
contributes to both loss_0
and loss_1
. And since w_0
and w_1
are equivalent to a w
in a single GPU case, their gradients should be the same (w_0.grad == w_1.grad == w.grad == 2
).
I created a PR with a test and a fix for it: https://github.com/lightly-ai/lightly/pull/1531
Hi,
We've been implementing a model at cellarium-ml using your
NTXentLoss
. Comparing the model training with a single GPU and two GPUs we noticed that they do not match. By investigating it we found an apparent bug in theGatherLayer.backward
where gradients are not sum-reduced over GPUs. Here is our fixed version (https://github.com/cellarium-ai/cellarium-ml/blob/main/cellarium/ml/distributed/gather.py#L17-L21):and the test we wrote. Would you agree that this is indeed a bug? I would be happy to contribute a PR with the fix.