microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
https://arxiv.org/abs/2111.13196
MIT License
237 stars 34 forks source link

Can you share the multi-GPU training command? #7

Closed hpppppp8 closed 2 years ago

hpppppp8 commented 2 years ago

I saw you privide the single GPU training command, and run it successfully. But I got some troubles to use multi-GPU training.Can you privide the multi-GPU training command such as on the msrvtt dataset? Thanks for your work!

kevinlin311tw commented 2 years ago

Thanks for the question. Our code supports multi-gpu training. There are several possible methods to run multi-gpu training.

Assume you have a machine with 2 gpus, you could try the example command below

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2  
        src/tasks/run_caption_VidSwinBert.py 
        --config src/configs/VidSwinBert/msrvtt_8frm_default.json
        --train_yaml MSRVTT-v2/train_32frames.yaml
        --val_yaml MSRVTT-v2/val_32frames.yaml
        --per_gpu_train_batch_size 6
        --per_gpu_eval_batch_size 6
        --num_train_epochs 15
        --learning_rate 0.0003
        --max_num_frames 32
        --pretrained_2d 0
        --backbone_coef_lr 0.05
        --mask_prob 0.5
        --max_masked_token 45
        --zero_opt_stage 1
        --mixed_precision_method deepspeed
        --deepspeed_fp16
        --gradient_accumulation_steps 4
        --learn_mask_enabled
        --loss_sparse_w 0.5
        --output_dir ./output

You can also try this

CUDA_VISIBLE_DEVICES=0,1 mpirun -np 2  
        src/tasks/run_caption_VidSwinBert.py 
        --config src/configs/VidSwinBert/msrvtt_8frm_default.json
        --train_yaml MSRVTT-v2/train_32frames.yaml
        --val_yaml MSRVTT-v2/val_32frames.yaml
        --per_gpu_train_batch_size 6
        --per_gpu_eval_batch_size 6
        --num_train_epochs 15
        --learning_rate 0.0003
        --max_num_frames 32
        --pretrained_2d 0
        --backbone_coef_lr 0.05
        --mask_prob 0.5
        --max_masked_token 45
        --zero_opt_stage 1
        --mixed_precision_method deepspeed
        --deepspeed_fp16
        --gradient_accumulation_steps 4
        --learn_mask_enabled
        --loss_sparse_w 0.5
        --output_dir ./output
hpppppp8 commented 2 years ago

Thank you for your reply! It works!

kevinlin311tw commented 2 years ago

Closing as it works for now.

tiesanguaixia commented 1 year ago

I saw you privide the single GPU training command, and run it successfully. But I got some troubles to use multi-GPU training.Can you privide the multi-GPU training command such as on the msrvtt dataset? Thanks for your work!

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' and the 'learning_rate' in command? For the 'loss_sparsew', I guess it's the regularization hyperparameter of $Loss{SPARSE}$ , i.e. the $\lambda$ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when $\lambda$ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

ybsu commented 1 year ago

I saw you privide the single GPU training command, and run it successfully. But I got some troubles to use multi-GPU training.Can you privide the multi-GPU training command such as on the msrvtt dataset? Thanks for your work!

您好,我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,特向您请教,如何把程序跑通是第一步。。非常感谢

File "src/tasks/run_caption_VidSwinBert.py", line 689, in main(args) File "src/tasks/run_caption_VidSwinBert.py", line 675, in main args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer) File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precisioninit model, optimizer, , _ = deepspeed.initialize( File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend rank = int(os.environ["RANK"]) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'RANK'