In your code, you gather all the embeddings from gpus in DDP.
But the loss function you comput is not divided by the nums of GPU. I think this post explain why need to divide.
So,whether the loss is divided by nums of GPU or not doesn't matter? Hope your explaination.Or, I miss something.
We do not need to explicit divide the gradient by the number of GPU here. The loss is divided by the batch size in the end, which takes the number of GPUs into account.
In your code, you gather all the embeddings from gpus in DDP.
But the loss function you comput is not divided by the nums of GPU. I think this post explain why need to divide.
So,whether the loss is divided by nums of GPU or not doesn't matter? Hope your explaination.Or, I miss something.