microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

About multi-gpu loss calculation #13

Closed forence closed 3 years ago

forence commented 3 years ago

Thanks for your nice work! I notice there is a mean() when the program runs on multi-gpus, but there is not any gather-operation. In other words, the loss in
https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/main_pretrain.py#L332 is a scale but not a list of tensor. Am I right?

ArrowLuo commented 3 years ago

Hi @forence, As I know, the gather-operation is performed by the pytorch's torch.nn.parallel.DistributedDataParallel. You can print loss to confirm it.

forence commented 3 years ago

Hi Arrow, I print the loss, the results are following, I conduct this test on two-gpus. device: 1 loss: 0.20037230849266052 device: 0 loss: 0.19869431853294373 device: 0 loss: 0.2036360800266266 device: 1 loss: 0.2001209855079651 device: 0 loss: 0.20593053102493286 device: 1 loss: 0.20257243514060974 device: 1 loss: 0.19430749118328094 device: 0 loss: 0.19785669445991516 device: 1 loss: 0.19507986307144165 In my view, the mean() is not work in this place, since there is not a gather-function to gather multi-GPUs' loss explicitly, but grads of different GPUs are gathered automatically by the PyTorch's ddp as you mentioned. Do I miss something?

ArrowLuo commented 3 years ago

Hi @forence, You are right. I am confused with torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel. Thank you to point it out. The mean() is indeed redundant in our code. Thanks.

forence commented 3 years ago

By the way, I pre-train the model at stage-one using maxMarginRankingLoss, the loss is extremely low about 0.002 at the beginning (bsz is 2048, gradient_accumulation_steps is 16). Is this normal? How to judge when training is ready to start stage-two?

ArrowLuo commented 3 years ago

does the at the beginning mean before the first epoch finished? Our loss is not so small at the beginning. The important thing is whether the loss is convergent. Besides, what is your pretrain dataset, and why the loss is maxMarginRankingLoss? I think the NCE loss will be better when pretraining.

forence commented 3 years ago

Yes, 0.002 is the loss of the end of the 1st epoch. However, I do see a decline in loss. I use maxMarginRankingLoss because our dataset has only one positive for one sample.

ArrowLuo commented 3 years ago

The situation that one positive for one sample can still use NCE or CE loss. If you use maxMarginRankingLoss to pretrain in your setting, you need to set a bigger learning rate if I remember correctly. In my experience, the loss will decrease fastly at the first epoch (see the log printed via here).

forence commented 3 years ago

Yes, 0.002 is the loss of the end of the 1st epoch. However, I do see a decline in loss. I use maxMarginRankingLoss because our dataset has only one positive for one sample.

Oh right, I will try this later! Could you provide the general loss range of each two stages for reference?

ArrowLuo commented 3 years ago

For your reference, 0.13->0.02 and 0.12->0.09 at the two stages. They are not so exact due to the bad log caused by the machines' problem. One more time, the convergent is more important.

forence commented 3 years ago

Thanks for your kindly respond! All the best :)