Closed scyonggg closed 10 months ago
Since I don't have slurm cluster server, I trained on my local GPU machine with following script
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \ --data-path ./data/in1k \ --model 'biformer_tiny' \ --output_dir './outputs' \ --input-size 224 \ --batch-size 128 \ --drop-path 0.1 \ --lr 5e-4 \ --dist-eval
This looks fine. I'm not sure if the 0.1 performance drop (81.37->81.26) is a normal fluctuation because I only ran the experiment once. Try batch_size=256
(hence the effective batch size is 256 x 8=2048) also.
I have tried several experiments such as change lr from
5e-4
to1e-3
, as mentioned in your paper, 1024 batch size with different number of GPUs, like 256 batch/GPU with 4 GPUs or 128 batch/GPU with 8 GPUs
In pytroch Distributed Data Parallel (DDP), the effective batch size (i.e. number_GPUs x per_gpu_batch_size) instead of per_gpu_batch_size matters, never waste your time on these trials. Also note that the lr is automatically linearly scaled, see main.py, so you should use --lr 5e-4
.
Could you please share the script used to train the
biformer-tiny
,small
andbase
models? It doesn't matter whether the script is based on hydra_main or slurm.
--batch-size 256 --drop-path 0.4 --lr 5e-4
on your local machine with 8 GPUs if your GPU memory is large enough (e.g. A100 80G).Thank you for reply, I thought the model was trained on 1024 batch size, as described in the paper.
Is there any reason for training the model with a batch size of 2048 while several vision backbone models were trained with a batch size of 1024?
No. At least for me, it is just a hardware issue (speed, memory, etc.). I’m not sure if it has a impact on final performance. It is too expensive to try everything.
But some literature suggests that the batch size should not affect performance once the learning rate is linearly scaled, see “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (1706.02677.pdf (arxiv.org)https://arxiv.org/pdf/1706.02677.pdf) ” for details.
On 15 Sep 2023, at 1:30 PM, Chanyong Shin @.***> wrote:
Thank you for reply, I thought the model was trained on 1024 batch size, as described in the paper.
Is there any reason for training the model with a batch size of 2048 while several vision backbone models were trained with a batch size of 1024?
— Reply to this email directly, view it on GitHubhttps://github.com/rayleizhu/BiFormer/issues/32#issuecomment-1720699335, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEYCTO4IBLEZ6C4NNAPIC2TX2PRVZANCNFSM6AAAAAA4V4CU4M. You are receiving this because you commented.Message ID: @.***>
Thanks, I've got the answer to my question. Let me close the issue
Thank you for your work, I'm really interested in your model.
I've tried to reproduce your results, especially
biformer-tiny
, but I was not able to get the same accuracy in paper.Since I don't have slurm cluster server, I trained on my local GPU machine with following script
I have tried several experiments such as change lr from
5e-4
to1e-3
, as mentioned in your paper, 1024 batch size with different number of GPUs, like 256 batch/GPU with 4 GPUs or 128 batch/GPU with 8 GPUsHowever, the result I got was only 81.26%, failed to reproduce 81.4% in your paper (probably 81.37% according to your log, is it right?)
Could you please share the script used to train the
biformer-tiny
,small
andbase
models? It doesn't matter whether the script is based on hydra_main or slurm.Thank you.