Closed Zhaoyi-Yan closed 2 years ago
Hi @Zhaoyi-Yan , More details of torch.distributed.launch:
python -m torch.distributed.launch --help
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK] [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR] [--master_port MASTER_PORT] [--use_env] [-m] [--no_python] training_script
Please try --node_rank
to instead --local_rank
and adjust your script:
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node 16 --node_rank $NODE_RANK --master_addr $MASTER_IP --master_port $MASTER_PORT run_beit_pretraining.py --dist_url "tcp://${MASTER_IP}:23456" \
--data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --num_mask_patches 75 \
--model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
--batch_size 28 --lr 1.5e-3 --warmup_steps 10000 --epochs 150 \
--clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1
It works, thank you! Maybe it's a good idea to add these tips into Readme
of Beit.
Describe I am not sure how to successfully run the distributed mode of beit. From the experience I obtained when run the training script of
Moco-v3
https://github.com/facebookresearch/moco-v3 ,I wrote two scripts:
run_pretrain_21k_m1.sh
run_pretrain_21k_m2.sh
, it differs withrun_pretrain_21k_m2.sh
inlocal_rank
.then we manually wrote the following codes for two subtasks.
However, it occured an error: