microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.2k stars 2.55k forks source link

How to run in distribtuted mode for beit ? #671

Closed Zhaoyi-Yan closed 2 years ago

Zhaoyi-Yan commented 2 years ago

Describe I am not sure how to successfully run the distributed mode of beit. From the experience I obtained when run the training script of Moco-v3 https://github.com/facebookresearch/moco-v3 ,

I wrote two scripts:

run_pretrain_21k_m1.sh

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=16 run_beit_pretraining.py --world_size 2 --local_rank 0 --dist_url "tcp://${MASTER_IP}:23456" \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --num_mask_patches 75 \
        --model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
        --batch_size 28 --lr 1.5e-3 --warmup_steps 10000 --epochs 150 \
        --clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1

run_pretrain_21k_m2.sh, it differs with run_pretrain_21k_m2.sh in local_rank.

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=16 run_beit_pretraining.py --world_size 2 --local_rank 1 --dist_url "tcp://${MASTER_IP}:23456" \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --num_mask_patches 75 \
        --model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
        --batch_size 28 --lr 1.5e-3 --warmup_steps 10000 --epochs 150 \
        --clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1

then we manually wrote the following codes for two subtasks.

cd /userhome/yzy/Beit/unilm/beit; bash master_ip.sh; bash /userhome/basic_1.sh; bash run_pretrain_21k_m1.sh 0 >logs/beit_base_21k.txt 2>&1
cd /userhome/yzy/Beit/unilm/beit; bash /userhome/basic_1.sh; sleep 1m; bash run_pretrain_21k_m2.sh 1

However, it occured an error:

  File "run_beit_pretraining.py", line 230, in main
    warmup_epochs=args.warmup_epochs, warmup_steps=args.warmup_steps,
  File "/userhome/yzy/Beit/unilm/beit/utils.py", line 399, in cosine_scheduler
    main(opts)
  File "run_beit_pretraining.py", line 230, in main
    warmup_epochs=args.warmup_epochs, warmup_steps=args.warmup_steps,
  File "/userhome/yzy/Beit/unilm/beit/utils.py", line 399, in cosine_scheduler
addf400 commented 2 years ago

Hi @Zhaoyi-Yan , More details of torch.distributed.launch:

python -m torch.distributed.launch --help
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK] [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR] [--master_port MASTER_PORT] [--use_env] [-m] [--no_python] training_script

Please try --node_rank to instead --local_rank and adjust your script:


OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node 16 --node_rank $NODE_RANK  --master_addr $MASTER_IP --master_port $MASTER_PORT run_beit_pretraining.py  --dist_url "tcp://${MASTER_IP}:23456" \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --num_mask_patches 75 \
        --model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
        --batch_size 28 --lr 1.5e-3 --warmup_steps 10000 --epochs 150 \
        --clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1
Zhaoyi-Yan commented 2 years ago

It works, thank you! Maybe it's a good idea to add these tips into Readme of Beit.