Open kkarrancsu opened 2 years ago
@kkarrancsu Thanks for your answer.
In general, idist.all_gather()
can be used as long as the call is made collectively by all the processes. Therefore, you can use this method to gather the predictions in your training loop.
I can provide an example asap and maybe update the doc accordingly.
However, I'm not completely sure about your question. In fact, if you want to compute predictions in ddp, gather and back propagate from one proc, it won't work. You can check the internal design https://pytorch.org/docs/stable/notes/ddp.html#internal-design
@sdesrozis Thanks for your quick reply! Sorry if my initial question was unclear. As an example:
m = model()
m_dp = nn.DataParallel(m)
m_ddp = nn.DistributedDataParallel(m)
x = input # [batch_size, ...]
y_dp = m_dp(x) # [batch_size, ...]
y_ddp = m_ddp(x) # [batch_size/ngpu, ...]
I'd like to gather all the y_ddp
from all gpu's before computing a loss. I hope that makes the question clear?
Thanks for the clarification. Would you like to use the loss as a metric ? Or would you want to call loss.backward()
?
I'd like to call loss.backward()
Ok so I think it won't work even if you gather the predictions. The gathering operation is not an autodiff function so it will cut the graph computation. The forward pass creates some internal states that won't be gathered too.
Although I'm pretty sure that is answered in the PyTorch forum. Maybe I'm wrong though and I would be interested by a few discussions about this topic.
EDIT see here https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8
@sdesrozis Thanks - I will investigate based on your link and report back.
@sdesrozis Thanks - I will investigate based on your link and report back.
Good ! Although I’m doubtful about the link… Interesting by your feedback.
@kkarrancsu can you provide a bit more details on what exactly you would like to do ?
In DDP, data is distributed to N processes and model is cloned. When we do the forward pass each process obtains predictions y_preds = m_ddp(x)
on its data chunk and using a loss function and loss.backward()
we can compute gradients that are finally sum up and applied to the model internally by pytorch DDP model wrapper.
As for distributed autograd, you can check as well : https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework
Hi @vfdev-5, sure.
We are using the Supervised Contrastive loss to train an embedding. In Eq. 2 of the paper, we see that the loss depends on the number of samples used to compute it (positive and negative).
My colleague suggested to me that it is more optimal to compute the loss considering all examples (the entire batch), rather than considering batch/ngpu
samples (which is what would happen when using DDP and computing loss locally to each GPU). This is because the denominator in SupConLoss is computing the loss of negative samples, and by first aggregating all of the negative samples across gpus, you would get a more accurate loss.
Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance
This might give you some inspiration.
Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance
This might give you some inspiration.
This code is not so correct. Please check this issue: https://github.com/Spijkervet/SimCLR/issues/30 and my pr: https://github.com/Spijkervet/SimCLR/pull/46.
I have defined my train_step in the exact same way as in the cifar10 example. Is it possible to gather all of the predictions before computing the loss? I haven't seen examples of this pattern in the ignite examples (maybe I'm missing it?), but for my application, it is more optimal to compute the loss after aggregating the forward passes and targets run on multiple GPU's. This only matters when using
DistributedDataParallel
, sinceDataParallel
automatically aggregates the outputs.I see the
idist.all_gather()
function, but am unclear how to use it in a training loop.