How can I train the model with single GPU

songhwanjun commented 2 years ago

Hi, could you try this command: 'n_iter_to_acc' is the step size of gradient accumulation, thus it updates the model every 8 iterations with grad accumulation.

python -m torch.distributed.launch \
       --nproc_per_node=1 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 16 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output \
       --n_iter_to_acc 8

IISCAditayTripathi commented 2 years ago

How long does it take to train on a single GPU? Kindly provide the details of the GPU that you used. Moreover, what was the batch size that you were able to fit in a GPU?

songhwanjun commented 2 years ago

We used multiple NVIDIA V100 GPUs (32 VRAM). Exactly speaking, four V100 was used for ViDT-nano and eight V100 for others. The training time taken for each model is as follows:

ViDT-nano / 4 V100 / 3 days 30mins
ViDT-tiny / 8 V100 / 2 days 19hrs 20mins
ViDT-small / 8 V100 / 3 days 12hrs 30mins
ViDT-base / 8 V100 / 4 days 12hrs 30mins

In all cases, the batch size was '16' in total. However, the size of the batch each GPU accommodates is different according to the model size. Thus, 16/4=4 batch size per GPU in the four GPU setup (only for ViDT-nano), 16/8=2 batch size per GPU in the eight GPU setup (for ViDT-tiny, small, base).

Thanks,

IISCAditayTripathi commented 2 years ago

Were you only able to fit 4 images in a GPU of 32 GB memory (for ViDT-nano) or you choose the total batch size of 16?

songhwanjun commented 2 years ago

We can use a larger batch size for ViDT-nano. We chose the batch size of 16 for consistency. Maybe you can use 32 batch size, but need to increase the learning rate (x2 or x root(2)).

IISCAditayTripathi commented 2 years ago

Thanks for providing the important details. I am curious about why do we need to increase the learning rate if we increase the batch size?

songhwanjun commented 2 years ago

There is an important paper [1] that says that increasing the batch size has the effect of reducing the learning rate, slowing down the convergence. Hence, to compensate for this negative effect, a solution is to increase the learning rate with the same (or similar) degree for batch increase.

[1] DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE, ICLR 2018

muzairkhattak commented 2 years ago

Thanks for providing the important details. I am curious about why do we need to increase the learning rate if we increase the batch size?

Quoting from a paper from Facebook "When the minibatch size is multiplied by k, multiply the learning rate by k.", such settings will provide same result with faster training (as batch size is increased)

For more information, you can take a look into the paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

naver-ai / vidt

How can I train the model with single GPU #3