microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.82k stars 4.05k forks source link

[BUG]exits with return code = -9 #4181

Open liyifo opened 1 year ago

liyifo commented 1 year ago

Describe the bug I can train with a single 3090, but with two 3090s I get an error with no prompting.

ds_report output image

Screenshots image

System info (please complete the following information):

Launcher context deepspeed --num_gpus=2 fastchat/train/train_lora.py \ --model_name_or_path ../vicuna-13b-v1.5 \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --data_path ../4k_gossip_real_train.json \ --bf16 True \ --output_dir ../vicuna-13b-gossip-output \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 100 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 4100 \ --tf32 True \ --q_lora True \ --deepspeed playground/deepspeed_config_s2.json \ --gradient_checkpointing True

Docker context I'm in an lxc container.

mrwyattii commented 1 year ago

Hi @liyifo could you please try running a simple script to verify that the DeepSpeed launcher can work with multi-GPU for your setup?

Run the following and report back if you see an error:

# Run with `deepspeed --num_gpus 2 hello-world.py`
import os
import deepspeed
deepspeed.init_distributed()
local_rank = os.getenv("LOCAL_RANK")
world_size = os.getenv("WORLD_SIZE")
print(local_rank, world_size)

Also, since this fastchat script is using transformers, could you share the version of that library you are using? pip list | grep transformers

liyifo commented 1 year ago

Hi @liyifo could you please try running a simple script to verify that the DeepSpeed launcher can work with multi-GPU for your setup?

Run the following and report back if you see an error:

# Run with `deepspeed --num_gpus 2 hello-world.py`
import os
import deepspeed
deepspeed.init_distributed()
local_rank = os.getenv("LOCAL_RANK")
world_size = os.getenv("WORLD_SIZE")
print(local_rank, world_size)

Also, since this fastchat script is using transformers, could you share the version of that library you are using? pip list | grep transformers

transformers 4.31.0 image

mrwyattii commented 1 year ago

@liyifo can you please install the missing module? pip install mpi4py

That might fix the original error you were seeing

liyifo commented 1 year ago

@liyifo can you please install the missing module? pip install mpi4py

That might fix the original error you were seeing

Sorry, I used to execute from the command line, but now it's the result of deepspeed executing the file image

loadams commented 1 year ago

@liyifo - that looks to have completed successfully, are you still seeing errors with your original script? Or can you post the new error from that?