microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.07k stars 1.04k forks source link

Run multi-node training failed, how to train without hostfile #381

Open xiaoyi0814 opened 1 year ago

xiaoyi0814 commented 1 year ago

I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh But it seems that the multi-nodes are not launched successfully and a warning in the log as below:

2023-04-21 03:19:45,810] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 03:19:52,167] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-04-21 03:19:52,167] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-04-21 03:19:52,167] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-04-21 03:19:52,167] [INFO] [launch.py:247:main] dist_world_size=4
[2023-04-21 03:19:52,167] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3

However, I cant get the IP for each node before I started the training. The GPUs are allocated when you are starting a task, so I can't set the hostfile before. How can I use the multi-nodes training just like torch did? Just pass the master_addr and master_port?

Thanks!

muziyongshixin commented 1 year ago

I have the same issue

muziyongshixin commented 1 year ago

I want to train a 30B llama model on 16 A100 GPU (2node * 8 cards). I follow the transformers document,and use the script bellow to start training, It can launch the run, but it seems to be slower than just one node with 8 cards.

The 2 node trail logs like:[2023-04-21 19:36:50,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=13120, skipped=223, lr=[1.9668084143292992e-05], mom=[(0.9, 0.95)] [2023-04-21 19:36:50,424] [INFO] [timer.py:199:stop] epoch=0/micro_step=52480/global_step=13120, RunningAvgSamplesPerSec=4.5431656108358975, CurrSamplesPerSec=4.584817867742811, MemAllocated=31.36GB, MaxMemAllocated=36.97GB The 1 node trail logs like: [2023-04-21 07:14:26,157] [INFO] [logging.py:96:log_dist] [Rank 0] step=19180, skipped=346, lr=[1.6947090538430577e-08], mom=[(0.9, 0.95)] [2023-04-21 07:14:26,161] [INFO] [timer.py:199:stop] epoch=1/micro_step=38352/global_step=19180, RunningAvgSamplesPerSec=2.8659832077822576, CurrSamplesPerSec=2.9150100342807246, MemAllocated=61.66GB, MaxMemAllocated=67.26GB [2023-04-

It seems that the CurrSamplesPerSec on 2 node trail is bigger than the 1 node trail, but the 2 node trail run much more time and It still not finished the 1st epoch. And the lr seems not reduced during training, I don't know what is wrong. (pink line is for 1 node trail, brown is for 2node trail). image

The script is bellow.


  ARNOLD_WORKER_GPU=8
  ARNOLD_WORKER_NUM=2
  ARNOLD_WORKER_0_HOST=<work0_ip>
  port=<work0_master_port>
  python3 -m torch.distributed.run \
   --nproc_per_node $ARNOLD_WORKER_GPU --nnodes $ARNOLD_WORKER_NUM  \
   --rdzv_id=$WANDB_PROJECT.$ARNOLD_WORKER_0_HOST \
   --rdzv_endpoint=$ARNOLD_WORKER_0_HOST:$port \
   --rdzv_backend=c10d \
   main.py \
     --data_path <data_path> \
     --data_split 1,0,0 \
     --model_name_or_path <model_path> \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 1 \
     --max_seq_len 1024 \
     --learning_rate 2e-5 \
     --weight_decay 0.1 \
     --num_train_epochs 2  \
     --gradient_accumulation_steps 4 \
     --lr_scheduler_type cosine \
     --num_warmup_steps 300 \
     --seed 1234 \
     --gradient_checkpointing \
     --zero_stage $ZERO_STAGE \
     --deepspeed \
     --output_dir $OUTPUT \
     &> $OUTPUT/training.log
muziyongshixin commented 1 year ago

I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh But it seems that the multi-nodes are not launched successfully and a warning in the log as below:

2023-04-21 03:19:45,810] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 03:19:52,167] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-04-21 03:19:52,167] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-04-21 03:19:52,167] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-04-21 03:19:52,167] [INFO] [launch.py:247:main] dist_world_size=4
[2023-04-21 03:19:52,167] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3

However, I cant get the IP for each node before I started the training. The GPUs are allocated when you are starting a task, so I can't set the hostfile before. How can I use the multi-nodes training just like torch did? Just pass the master_addr and master_port?

Thanks!

I found the solution. This code is mainly for the DeepSpeed command to launch. You can simply use torchrun (like the script I posted above) to launch the training. However, you need to add one line after args = parse_args() with args.local_rank = int(os.environ.get('LOCAL_RANK', -1)). This is because torchrun will not initialize the local_rank value, which may cause unexpected behavior when creating the DataSampler.

GeekDream-x commented 11 months ago

same issue, how to train in a multi-node setting with launching with deepspeed?

strikinglee commented 10 months ago

Any updates on multi-node training?