microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.77k stars 3.96k forks source link

[BUG] Multi-node failure with Step3 RLHF Training with GPTJ6B on 2x8x32GBV100 #3672

Open hiteshis opened 1 year ago

hiteshis commented 1 year ago

Describe the bug I am not able to run the multi-node script for 6B actor and critic on 2 nodes of 8 V100 GPUs on Azure ML. I am running the following command:

deepspeed --master_port 29501 main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path EleutherAI/gpt-j-6b --critic_model_name_or_path /mnt/data/ds-chat-step2output/gptj6b --num_padding_at_beginning 1 --per_device_train_batch_size 1 --per_device_mini_train_batch_size 1 --generation_batch_numbers 1 --ppo_epochs 1 --max_answer_seq_len 50 --max_prompt_seq_len 256 --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --num_warmup_steps 100 --deepspeed_mpi --deepspeed --seed 1234 --enable_hybrid_engine --inference_tp_size 8 --tp_gather_partition_size 4 --actor_zero_stage 3 --critic_zero_stage 3 --actor_gradient_checkpointing --disable_actor_dropout --actor_lora_dim 128 --actor_lora_module_name decoder.layers. --output_dir /mnt/data/ds-chat-step3output

Log output RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29501 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

To Reproduce Steps to reproduce the behavior:

  1. Install requirements + mpi4py
  2. Run the command above

Expected behavior Run successfully end to end

Screenshots

image image

System info (please complete the following information):

Earlier, I was using OPT models but someone reported NCCL communication with these models, switched to GPTJ6B but no luck so far. Any help is greatly appreciated.

tiandongtao commented 1 year ago

same problem

tiandongtao commented 1 year ago

my problem has been solved,I run the script on 2 nodes ,deepspeed multi-node train only needs to be executed once,similar the horovod

hiteshis commented 1 year ago

@tiandongtao you are right. This does solve the address already in use issue. But now I am running into another problem

image

The program exits with return code = -9. This is with GPTJ 6B. Were you able to run it from end to end?