Open xiaoyi0814 opened 1 year ago
I have the same issue
I want to train a 30B llama model on 16 A100 GPU (2node * 8 cards). I follow the transformers document,and use the script bellow to start training, It can launch the run, but it seems to be slower than just one node with 8 cards.
The 2 node trail logs like:[2023-04-21 19:36:50,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=13120, skipped=223, lr=[1.9668084143292992e-05], mom=[(0.9, 0.95)] [2023-04-21 19:36:50,424] [INFO] [timer.py:199:stop] epoch=0/micro_step=52480/global_step=13120, RunningAvgSamplesPerSec=4.5431656108358975, CurrSamplesPerSec=4.584817867742811, MemAllocated=31.36GB, MaxMemAllocated=36.97GB
The 1 node trail logs like: [2023-04-21 07:14:26,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=19180, skipped=346, lr=[1.6947090538430577e-08], mom=[(0.9, 0.95)] [2023-04-21 07:14:26,161] [INFO] [timer.py:199:stop] epoch=1/micro_step=38352/global_step=19180, RunningAvgSamplesPerSec=2.8659832077822576, CurrSamplesPerSec=2.9150100342807246, MemAllocated=61.66GB, MaxMemAllocated=67.26GB [2023-04-
It seems that the CurrSamplesPerSec
on 2 node trail is bigger than the 1 node trail, but the 2 node trail run much more time and It still not finished the 1st epoch. And the lr seems not reduced during training, I don't know what is wrong. (pink line is for 1 node trail, brown is for 2node trail).
The script is bellow.
ARNOLD_WORKER_GPU=8
ARNOLD_WORKER_NUM=2
ARNOLD_WORKER_0_HOST=<work0_ip>
port=<work0_master_port>
python3 -m torch.distributed.run \
--nproc_per_node $ARNOLD_WORKER_GPU --nnodes $ARNOLD_WORKER_NUM \
--rdzv_id=$WANDB_PROJECT.$ARNOLD_WORKER_0_HOST \
--rdzv_endpoint=$ARNOLD_WORKER_0_HOST:$port \
--rdzv_backend=c10d \
main.py \
--data_path <data_path> \
--data_split 1,0,0 \
--model_name_or_path <model_path> \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--max_seq_len 1024 \
--learning_rate 2e-5 \
--weight_decay 0.1 \
--num_train_epochs 2 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--num_warmup_steps 300 \
--seed 1234 \
--gradient_checkpointing \
--zero_stage $ZERO_STAGE \
--deepspeed \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh But it seems that the multi-nodes are not launched successfully and a warning in the log as below:
2023-04-21 03:19:45,810] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-21 03:19:52,167] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-04-21 03:19:52,167] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-04-21 03:19:52,167] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-04-21 03:19:52,167] [INFO] [launch.py:247:main] dist_world_size=4 [2023-04-21 03:19:52,167] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
However, I cant get the IP for each node before I started the training. The GPUs are allocated when you are starting a task, so I can't set the hostfile before. How can I use the multi-nodes training just like torch did? Just pass the master_addr and master_port?
Thanks!
I found the solution. This code is mainly for the DeepSpeed command to launch. You can simply use torchrun (like the script I posted above) to launch the training. However, you need to add one line after args = parse_args()
with args.local_rank = int(os.environ.get('LOCAL_RANK', -1))
. This is because torchrun will not initialize the local_rank
value, which may cause unexpected behavior when creating the DataSampler.
same issue, how to train in a multi-node setting with launching with deepspeed?
Any updates on multi-node training?
I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh But it seems that the multi-nodes are not launched successfully and a warning in the log as below:
However, I cant get the IP for each node before I started the training. The GPUs are allocated when you are starting a task, so I can't set the hostfile before. How can I use the multi-nodes training just like torch did? Just pass the master_addr and master_port?
Thanks!