Closed Money-HY closed 2 years ago
How long does it take to train on a single GPU? Kindly provide the details of the GPU that you used. Moreover, what was the batch size that you were able to fit in a GPU?
We used multiple NVIDIA V100 GPUs (32 VRAM). Exactly speaking, four V100 was used for ViDT-nano and eight V100 for others. The training time taken for each model is as follows:
In all cases, the batch size was '16' in total. However, the size of the batch each GPU accommodates is different according to the model size. Thus, 16/4=4 batch size per GPU in the four GPU setup (only for ViDT-nano), 16/8=2 batch size per GPU in the eight GPU setup (for ViDT-tiny, small, base).
Thanks,
Were you only able to fit 4 images in a GPU of 32 GB memory (for ViDT-nano) or you choose the total batch size of 16?
We can use a larger batch size for ViDT-nano. We chose the batch size of 16 for consistency. Maybe you can use 32 batch size, but need to increase the learning rate (x2 or x root(2)).
Thanks for providing the important details. I am curious about why do we need to increase the learning rate if we increase the batch size?
There is an important paper [1] that says that increasing the batch size has the effect of reducing the learning rate, slowing down the convergence. Hence, to compensate for this negative effect, a solution is to increase the learning rate with the same (or similar) degree for batch increase.
[1] DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE, ICLR 2018
Thanks for providing the important details. I am curious about why do we need to increase the learning rate if we increase the batch size?
Quoting from a paper from Facebook "When the minibatch size is multiplied by k, multiply the learning rate by k.", such settings will provide same result with faster training (as batch size is increased)
For more information, you can take a look into the paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Hi, could you try this command: 'n_iter_to_acc' is the step size of gradient accumulation, thus it updates the model every 8 iterations with grad accumulation.