loss computation in distributed training

subash-khanal commented 1 year ago

Thanks for releasing the code for your amazing work!

I was trying to play with PCME/PCME++ a little bit. I have some confusion regarding the loss computation in distributed training. Specifically in this line, should not the loss be computed with embeddings gathered from all devices? Currently it looks like the embeddings from one modality (image) from only current rank device are being used to compute loss with gathered embeddings from all devices for another modality (caption). Could you help me if I am misunderstanding something here?

SanghyukChun commented 12 months ago

Hello, it looks like you have already resolved the issue. To be clear, the below instructions are executed by each GPU device.

loss_img, loss_dict_img = self.criterion(img_emb, {'mean': cap_emb_all}, distributed=True)
loss_cap, loss_dict_cap = self.criterion(cap_emb, {'mean': img_emb_all}, distributed=True)

These instructions are the same as the purple cells (if we think each cell is the mini-batch samples corresponding to each GPU) in the below figure (from my paper):

At the end of the loss computing, all losses are summed to compute the final loss, therefore it is the same as computing losses for every cell (every mini-batch)

Note that it is only for InfoNCE (i.e., batch-wise contrastive loss), while my implementation (e.g., PCME or PCME++) based on pair-wise contrastive loss needs a different solution.

SanghyukChun commented 11 months ago

Close the issue as it looks like the issue is resolved. Please re-open the issue if necessary.

naver-ai / pcmepp

loss computation in distributed training #1