[Feature] how to use torch.nn.DataParallel in mmpose single machine multi-card training

wintercat1994 commented 1 year ago

What is the feature?

Could you please tell me how to use torch.nn.DataParallel in mmpose single machine multi-card training to get bigger batchsize? I have observed that currently single-machine multi-card training can only use torch.nn.DistributedDataParallel, which makes the size of batchsize limited by single card video memory.

Any other context?

No response

Ben-Louis commented 1 year ago

When you are using torch.nn.DistributedDataParallel, the actual batch size is batch_size_per_gpu * num_gpus.

wintercat1994 commented 1 year ago

When you are using torch.nn.DistributedDataParallel, the actual batch size is batch_size_per_gpu * num_gpus.

Thank you very much for your answer! However, I added a head for contrastive learning during model training.When I printed out the data shape of the input of the contrastive learning head when calculating loss. It was found that only the data on the single card could be used to calculate the loss for contractive learning. This can affect the performance of contrast learning. Contrast loss may be calculated before adding the gradient of the data from multiple cards. May I ask how I can calculate the data on all cards when calculating contrastive learning loss?

Ben-Louis commented 1 year ago

I think you could try all_gather function to gather the features on all gpus

wintercat1994 commented 1 year ago

I think you could try all_gather function to gather the features on all gpus

Thank you! I will try it!

open-mmlab / mmpose

[Feature] how to use torch.nn.DataParallel in mmpose single machine multi-card training #2381

What is the feature?

Any other context?