The batchsize is 1 so how to use distributed dataparallel?

sabarim / STEm-Seg

This repository contains the official implementation of the paper "STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos"

153 stars 23 forks source link

HI, since we didn't have the hardware to train on several GPUs, we simulate larger batch sizes by accumulating the gradients over multiple training iterations (note that this is not exactly the same as training with a larger batch size, but that's a separate issue). In the above code, the gradients will be accumulated optimizer_step_interval times before calling optrmizer.step(). Also keep in mind that under DDP, this code will be executed in parallel in multiple processes, so even though the batch size given to the data loader is 1, the overall effective batch size is 1 self.num_gpus optimizer_step_interval.

sabarim / STEm-Seg

The batchsize is 1 so how to use distributed dataparallel? #5