rayleizhu / BiFormer

[CVPR 2023] Official code release of our paper "BiFormer: Vision Transformer with Bi-Level Routing Attention"
https://arxiv.org/abs/2303.08810
MIT License
460 stars 36 forks source link

Question about reproducing results of biformer-tiny on ImageNet1K classification #32

Closed scyonggg closed 10 months ago

scyonggg commented 10 months ago

Thank you for your work, I'm really interested in your model.

I've tried to reproduce your results, especially biformer-tiny, but I was not able to get the same accuracy in paper.

Since I don't have slurm cluster server, I trained on my local GPU machine with following script

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \
--data-path ./data/in1k \
--model 'biformer_tiny' \
--output_dir './outputs' \
--input-size 224 \
--batch-size 128 \
--drop-path 0.1 \
--lr 5e-4 \
--dist-eval

I have tried several experiments such as change lr from 5e-4 to 1e-3, as mentioned in your paper, 1024 batch size with different number of GPUs, like 256 batch/GPU with 4 GPUs or 128 batch/GPU with 8 GPUs

However, the result I got was only 81.26%, failed to reproduce 81.4% in your paper (probably 81.37% according to your log, is it right?)

Could you please share the script used to train the biformer-tiny, small and base models? It doesn't matter whether the script is based on hydra_main or slurm.

Thank you.

rayleizhu commented 10 months ago

Since I don't have slurm cluster server, I trained on my local GPU machine with following script

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \
--data-path ./data/in1k \
--model 'biformer_tiny' \
--output_dir './outputs' \
--input-size 224 \
--batch-size 128 \
--drop-path 0.1 \
--lr 5e-4 \
--dist-eval

This looks fine. I'm not sure if the 0.1 performance drop (81.37->81.26) is a normal fluctuation because I only ran the experiment once. Try batch_size=256 (hence the effective batch size is 256 x 8=2048) also.

I have tried several experiments such as change lr from 5e-4 to 1e-3, as mentioned in your paper, 1024 batch size with different number of GPUs, like 256 batch/GPU with 4 GPUs or 128 batch/GPU with 8 GPUs

In pytroch Distributed Data Parallel (DDP), the effective batch size (i.e. number_GPUs x per_gpu_batch_size) instead of per_gpu_batch_size matters, never waste your time on these trials. Also note that the lr is automatically linearly scaled, see main.py, so you should use --lr 5e-4.

Could you please share the script used to train the biformer-tiny, small and base models? It doesn't matter whether the script is based on hydra_main or slurm.

scyonggg commented 10 months ago

Thank you for reply, I thought the model was trained on 1024 batch size, as described in the paper.

Is there any reason for training the model with a batch size of 2048 while several vision backbone models were trained with a batch size of 1024?

rayleizhu commented 10 months ago

No. At least for me, it is just a hardware issue (speed, memory, etc.). I’m not sure if it has a impact on final performance. It is too expensive to try everything.

But some literature suggests that the batch size should not affect performance once the learning rate is linearly scaled, see “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (1706.02677.pdf (arxiv.org)https://arxiv.org/pdf/1706.02677.pdf) ” for details.

On 15 Sep 2023, at 1:30 PM, Chanyong Shin @.***> wrote:

Thank you for reply, I thought the model was trained on 1024 batch size, as described in the paper.

Is there any reason for training the model with a batch size of 2048 while several vision backbone models were trained with a batch size of 1024?

— Reply to this email directly, view it on GitHubhttps://github.com/rayleizhu/BiFormer/issues/32#issuecomment-1720699335, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEYCTO4IBLEZ6C4NNAPIC2TX2PRVZANCNFSM6AAAAAA4V4CU4M. You are receiving this because you commented.Message ID: @.***>

scyonggg commented 10 months ago

Thanks, I've got the answer to my question. Let me close the issue