[BUG] Multi-node failure with Step3 RLHF Training with GPTJ6B on 2x8x32GBV100

hiteshis commented 1 year ago

Describe the bug I am not able to run the multi-node script for 6B actor and critic on 2 nodes of 8 V100 GPUs on Azure ML. I am running the following command:

deepspeed --master_port 29501 main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path EleutherAI/gpt-j-6b --critic_model_name_or_path /mnt/data/ds-chat-step2output/gptj6b --num_padding_at_beginning 1 --per_device_train_batch_size 1 --per_device_mini_train_batch_size 1 --generation_batch_numbers 1 --ppo_epochs 1 --max_answer_seq_len 50 --max_prompt_seq_len 256 --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --num_warmup_steps 100 --deepspeed_mpi --deepspeed --seed 1234 --enable_hybrid_engine --inference_tp_size 8 --tp_gather_partition_size 4 --actor_zero_stage 3 --critic_zero_stage 3 --actor_gradient_checkpointing --disable_actor_dropout --actor_lora_dim 128 --actor_lora_module_name decoder.layers. --output_dir /mnt/data/ds-chat-step3output

Log output RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29501 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

To Reproduce Steps to reproduce the behavior:

Install requirements + mpi4py
Run the command above

Expected behavior Run successfully end to end

Screenshots

System info (please complete the following information):

Python 3.8, pytorch 1.12.1, cuda 11.6, deepspeed 0.9.0
GPU count and types: two nodes with 8 V100s each

Earlier, I was using OPT models but someone reported NCCL communication with these models, switched to GPTJ6B but no luck so far. Any help is greatly appreciated.

tiandongtao commented 1 year ago

same problem

tiandongtao commented 1 year ago

my problem has been solved，I run the script on 2 nodes ，deepspeed multi-node train only needs to be executed once，similar the horovod

hiteshis commented 1 year ago

@tiandongtao you are right. This does solve the address already in use issue. But now I am running into another problem

The program exits with return code = -9. This is with GPTJ 6B. Were you able to run it from end to end?

microsoft / DeepSpeed

[BUG] Multi-node failure with Step3 RLHF Training with GPTJ6B on 2x8x32GBV100 #3672