microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.7k stars 4.04k forks source link

[BUG]4张80G的A100好像不能支持基于lora的7b bloom在batch为4的条件下训练,Colossalai是可以的,比较困惑,我对比了一下,batch只能设置到1 #3361

Open NostalgiaOfTime opened 1 year ago

NostalgiaOfTime commented 1 year ago

Describe the bug 4张80G的A100好像不能支持基于lora的7b bloom在batch为4的条件下训练,Colossalai是可以的,比较困惑,我对比了一下,batch只能设置到1

To Reproduce 下面是我稍微修改适配bloom的脚本(官方只公开适配facebook的opt脚本) 官方指出gradient_checkpointing和only optimize lora是冲突的,因此我只用了only optimize lora

OUTPUT_PATH=/mnt/bn/simple-nas/mlx/users/zhangyawei.ywsq/playground/arnold_ywsq/DeepSpeedExamples/applications/DeepSpeed-Chat/save/actor-models/7b1_bloom_lora mkdir -p $OUTPUT_PATH

deepspeed --master_port 25104 --num_gpus 4 main.py --data_path xxx --data_split 10,0,0 --model_name_or_path xxx --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 2048 --learning_rate 1e-3 --weight_decay 0.1 --num_train_epochs 3 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --lora_dim 128 --lora_module_name transformer.h. --only_optimize_lora --deepspeed --output_dir $OUTPUT_PATH &> $OUTPUT_PATH/training.log

ScottishFold007 commented 1 year ago

试试bfloat16、gradient_checkpointing=True

NostalgiaOfTime commented 1 year ago

@ScottishFold007 我看源码好像默认是使用float16,且gradient_checkpointing和only_optimize_lora不能同时使用,官方开源的代码想要LoRA就必须得放弃gradient_checkpointing。 按道理说不应该站这么大显存的,毕竟用了LoRA,每层只有两个缩放矩阵需要BP,实际优化的参数量很小

AnnieHu1006 commented 1 year ago

您好,请问您跑通deepspeed chat时候的docker image里的配置是哪些?我这边一直报错好像是安装的包的版本之间有冲突,多谢