Closed hpppppp8 closed 2 years ago
Thanks for the question. Our code supports multi-gpu training. There are several possible methods to run multi-gpu training.
Assume you have a machine with 2 gpus, you could try the example command below
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2
src/tasks/run_caption_VidSwinBert.py
--config src/configs/VidSwinBert/msrvtt_8frm_default.json
--train_yaml MSRVTT-v2/train_32frames.yaml
--val_yaml MSRVTT-v2/val_32frames.yaml
--per_gpu_train_batch_size 6
--per_gpu_eval_batch_size 6
--num_train_epochs 15
--learning_rate 0.0003
--max_num_frames 32
--pretrained_2d 0
--backbone_coef_lr 0.05
--mask_prob 0.5
--max_masked_token 45
--zero_opt_stage 1
--mixed_precision_method deepspeed
--deepspeed_fp16
--gradient_accumulation_steps 4
--learn_mask_enabled
--loss_sparse_w 0.5
--output_dir ./output
You can also try this
CUDA_VISIBLE_DEVICES=0,1 mpirun -np 2
src/tasks/run_caption_VidSwinBert.py
--config src/configs/VidSwinBert/msrvtt_8frm_default.json
--train_yaml MSRVTT-v2/train_32frames.yaml
--val_yaml MSRVTT-v2/val_32frames.yaml
--per_gpu_train_batch_size 6
--per_gpu_eval_batch_size 6
--num_train_epochs 15
--learning_rate 0.0003
--max_num_frames 32
--pretrained_2d 0
--backbone_coef_lr 0.05
--mask_prob 0.5
--max_masked_token 45
--zero_opt_stage 1
--mixed_precision_method deepspeed
--deepspeed_fp16
--gradient_accumulation_steps 4
--learn_mask_enabled
--loss_sparse_w 0.5
--output_dir ./output
Thank you for your reply! It works!
Closing as it works for now.
I saw you privide the single GPU training command, and run it successfully. But I got some troubles to use multi-GPU training.Can you privide the multi-GPU training command such as on the msrvtt dataset? Thanks for your work!
Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' and the 'learning_rate' in command? For the 'loss_sparsew', I guess it's the regularization hyperparameter of $Loss{SPARSE}$ , i.e. the $\lambda$ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when $\lambda$ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!
I saw you privide the single GPU training command, and run it successfully. But I got some troubles to use multi-GPU training.Can you privide the multi-GPU training command such as on the msrvtt dataset? Thanks for your work!
您好,我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,特向您请教,如何把程序跑通是第一步。。非常感谢
File "src/tasks/run_caption_VidSwinBert.py", line 689, in main(args) File "src/tasks/run_caption_VidSwinBert.py", line 675, in main args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer) File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precisioninit model, optimizer, , _ = deepspeed.initialize( File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend rank = int(os.environ["RANK"]) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'RANK'
I saw you privide the single GPU training command, and run it successfully. But I got some troubles to use multi-GPU training.Can you privide the multi-GPU training command such as on the msrvtt dataset? Thanks for your work!