Closed EDENpraseHAZARD closed 3 years ago
HI, since we didn't have the hardware to train on several GPUs, we simulate larger batch sizes by accumulating the gradients over multiple training iterations (note that this is not exactly the same as training with a larger batch size, but that's a separate issue). In the above code, the gradients will be accumulated optimizer_step_interval
times before calling optrmizer.step()
. Also keep in mind that under DDP, this code will be executed in parallel in multiple processes, so even though the batch size given to the data loader is 1, the overall effective batch size is 1 self.num_gpus
optimizer_step_interval
.
Hi, thanks for releasing the codes. I'm trying to use distributed dataparallel(DDP) to train the model in Youtube-VIS, but I find the batch_size is 1 in single gpu. It's no use with DDP when the batchsize is 1. The batchsize is actually the max_per_gpu which is 1 shown in the picture below. So I want to know how many nodes and batchsize you used during the experiments