Open YANDaoyu opened 9 months ago
same question. Looking forward to the answer. @canqin001 @shugerdou
Thank you for this question. For multigpu training, the overall batch-size would be num_per_batch * num_batch. The 700k iterations is independent of the batch-size. So, it is needed to manually assign the iterations to match the overall computation cost.
That's definitely an impressive work!
I'm trying to reproduce some results on inpainting task and had some concern about the data_parallel mode. Referring to the codes, batch_size is 4 for single GPU, total pairs of inpainting data is about 2.8m, thus the total log step is 700k. When I training it on 8-GPUs, the total step still log as 700k, then I've checked the GPU-memory usage -- all the GPU are nearly fully used. So I just wondering the training batch_size for 8-GPU is 4*8 or not? Or say there are some misalignment in logging?
Thanks for your time.