Closed JWargrave closed 1 year ago
The problem is that doing the gather is not compatible with gradient propagation: it clones the tensors, so basically gradients don't flow backward. You can compute the loss on each process without the gather, the gradients will be averaged at the end of the backward pass.
@sgugger Thanks for your reply.
In contrastive learning, InfoNCE loss is often used as a loss function:
So the gradients will be averaged
is not equivalent to increasing the batch_size of contrastive learning
.
I would like to know how to implement a gather function that preserves the gradient
.
Quick google sleuthing gave me this for you to try @JWargrave, though I do think this may be better as a discussion post on the forums: https://discuss.huggingface.co/c/accelerate/18
https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8
@muellerzr Thanks a lot. I have also post the discussion at https://discuss.huggingface.co/t/question-bug-about-accelerator-gather-how-to-use-accelerate-accelerator-gather-for-contrastive-learning/33177?u=jwargrave.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi, there.
I am new to
accelerate
and I've found that it really improves my development productivity. Thanks for your great work.But I have some problems when using
accelerator.gather
.I trained a simple
resnet18
classifier on theCIFAR10
dataset. The training loop is:The code above works well and the training accuracy reaches about 70% after 10 epochs.
But there is a problem when I train as follows:
The training loss is almost unchanged, and the training accuracy has been maintained at about 10%, which is equivalent to random guessing.
The above code may look weird, but I don't think it should be wrong, but it is.
( The reason I'm doing this is that I want to use
accelerate
when training for contrastive learning tasks. In contrastive learning, the larger the batch_size, the better, and each sample in the batch uses all other samples in the batch as negative examples to calculate the loss. For example, when I train with four gpus and the batch_size of each gpu is 64, I want each sample to be compared with64*4-1
negative samples instead of64-1
. In this case I need to useaccelerator.gather
.)The full code is as follows: (it works well for
loss plan 1
but not forloss plan 2
)I'm wondering where I'm going wrong with my code, or how I should use accelerator.gather correctly.
Thanks a lot.