microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.04k stars 1.03k forks source link

step1-sft use lora failed #299

Closed bytes-lost closed 1 year ago

bytes-lost commented 1 year ago

env

gpu: 4*A100 80G
pytorch: 1.13.1
cuda version: 11.7
deepspeed: 0.9.0
transformers: 4.28.0.dev

run script

OUTPUT=$1
ZERO_STAGE=3
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=3
fi
mkdir -p $OUTPUT

deepspeed main.py \
   --data_path path/to/local/data \
   --model_name_or_path path/to/codegen-16B-multi \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --max_seq_len 2048 \
   --learning_rate 1e-4 \
   --weight_decay 0.1 \
   --num_train_epochs 5  \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --only_optimize_lora \
   --zero_stage $ZERO_STAGE \
   --lora_dim 128 \
   --lora_module_name decoder.layers. \
   --deepspeed \
   --output_dir $OUTPUT

error message

Traceback (most recent call last):
  File "/mnt/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in <module>
    main()
  File "/mnt/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 273, in main
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/__init__.py", line 156, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 328, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1465, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
asurdo commented 1 year ago

Same Error

gpu: 8*A100 40G
pytorch: 2.0.0
cuda version: 11.7
deepspeed: 0.9.0+0b5252b
transformers: 4.28.0.dev
deepspeed main.py \
   --data_path BelleGroup/train_1M_CN \
   --model_name_or_path gpt-neox-20b/ \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --max_seq_len 512 \
   --learning_rate 9.65e-5 \
   --weight_decay 0.1 \
   --num_train_epochs 2  \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --lora_dim 128 \
   --only_optimize_lora \
   --zero_stage 3 \
   --deepspeed \
   --output_dir $OUTPUT_PATH
zhengyanzhao1997 commented 1 year ago

同样的错误

puyuanOT commented 1 year ago

I got the same error

puyuanOT commented 1 year ago

The script worked for opt models but do not work for other models. I guess it has something to do with the model format.

puyuanOT commented 1 year ago

I found the solution. Basically you have to change --lora_module_name decoder.layers. to the appropriate name for you model, for example, --lora_module_name h. for bloom and gpt-neo.

alibabadoufu commented 1 year ago

I found the solution. Basically you have to change --lora_module_name decoder.layers. to the appropriate name for you model, for example, --lora_module_name h. for bloom and gpt-neo.

Thanks for your suggestion! Do you know what is the lora_module_name for llama model?

yaozhewei commented 1 year ago

Thank you @puyuanOT :). Yes, the LoRA replacement is based on the model arch (or the name)

ustc-lishuai commented 1 year ago

I found the solution. Basically you have to change --lora_module_name decoder.layers. to the appropriate name for you model, for example, --lora_module_name h. for bloom and gpt-neo.

Thanks for your suggestion! Do you know what is the lora_module_name for llama model?

你可以执行这行代码 from transformers import AutoModel model = AutoModel.from_pretrained("llama-7b-zpn") for name, module in model.named_modules(): print(name) image 可以看到 layer.开头,--lora_module_name layers.这样写就可以了

alibabadoufu commented 1 year ago

I found the solution. Basically you have to change --lora_module_name decoder.layers. to the appropriate name for you model, for example, --lora_module_name h. for bloom and gpt-neo.

Thanks for your suggestion! Do you know what is the lora_module_name for llama model?

你可以执行这行代码 from transformers import AutoModel model = AutoModel.from_pretrained("llama-7b-zpn") for name, module in model.named_modules(): print(name) image 可以看到 layer.开头,--lora_module_name layers.这样写就可以了

Yup, 我是这么做的。现在已经可以跑通了。感谢哟 :D

shyoulala commented 1 year ago

nice