Open liyifo opened 1 year ago
Hi @liyifo could you please try running a simple script to verify that the DeepSpeed launcher can work with multi-GPU for your setup?
Run the following and report back if you see an error:
# Run with `deepspeed --num_gpus 2 hello-world.py`
import os
import deepspeed
deepspeed.init_distributed()
local_rank = os.getenv("LOCAL_RANK")
world_size = os.getenv("WORLD_SIZE")
print(local_rank, world_size)
Also, since this fastchat script is using transformers
, could you share the version of that library you are using? pip list | grep transformers
Hi @liyifo could you please try running a simple script to verify that the DeepSpeed launcher can work with multi-GPU for your setup?
Run the following and report back if you see an error:
# Run with `deepspeed --num_gpus 2 hello-world.py` import os import deepspeed deepspeed.init_distributed() local_rank = os.getenv("LOCAL_RANK") world_size = os.getenv("WORLD_SIZE") print(local_rank, world_size)
Also, since this fastchat script is using
transformers
, could you share the version of that library you are using?pip list | grep transformers
transformers 4.31.0
@liyifo can you please install the missing module?
pip install mpi4py
That might fix the original error you were seeing
@liyifo can you please install the missing module?
pip install mpi4py
That might fix the original error you were seeing
Sorry, I used to execute from the command line, but now it's the result of deepspeed executing the file
@liyifo - that looks to have completed successfully, are you still seeing errors with your original script? Or can you post the new error from that?
Describe the bug I can train with a single 3090, but with two 3090s I get an error with no prompting.
ds_report output
Screenshots
System info (please complete the following information):
Launcher context
deepspeed --num_gpus=2 fastchat/train/train_lora.py \ --model_name_or_path ../vicuna-13b-v1.5 \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --data_path ../4k_gossip_real_train.json \ --bf16 True \ --output_dir ../vicuna-13b-gossip-output \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 100 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 4100 \ --tf32 True \ --q_lora True \ --deepspeed playground/deepspeed_config_s2.json \ --gradient_checkpointing True
Docker context I'm in an lxc container.