About multi-gpus training

YANDaoyu commented 9 months ago

That's definitely an impressive work!

I'm trying to reproduce some results on inpainting task and had some concern about the data_parallel mode. Referring to the codes, batch_size is 4 for single GPU, total pairs of inpainting data is about 2.8m, thus the total log step is 700k. When I training it on 8-GPUs, the total step still log as 700k, then I've checked the GPU-memory usage -- all the GPU are nearly fully used. So I just wondering the training batch_size for 8-GPU is 4*8 or not? Or say there are some misalignment in logging?

Thanks for your time.

gallenszl commented 8 months ago

same question. Looking forward to the answer. @canqin001 @shugerdou

canqin001 commented 8 months ago

Thank you for this question. For multigpu training, the overall batch-size would be num_per_batch * num_batch. The 700k iterations is independent of the batch-size. So, it is needed to manually assign the iterations to match the overall computation cost.

salesforce / UniControl

About multi-gpus training #27