[BUG]4张80G的A100好像不能支持基于lora的7b bloom在batch为4的条件下训练，Colossalai是可以的，比较困惑，我对比了一下，batch只能设置到1

NostalgiaOfTime commented 1 year ago

Describe the bug 4张80G的A100好像不能支持基于lora的7b bloom在batch为4的条件下训练，Colossalai是可以的，比较困惑，我对比了一下，batch只能设置到1

To Reproduce 下面是我稍微修改适配bloom的脚本(官方只公开适配facebook的opt脚本) 官方指出gradient_checkpointing和only optimize lora是冲突的，因此我只用了only optimize lora

OUTPUT_PATH=/mnt/bn/simple-nas/mlx/users/zhangyawei.ywsq/playground/arnold_ywsq/DeepSpeedExamples/applications/DeepSpeed-Chat/save/actor-models/7b1_bloom_lora mkdir -p $OUTPUT_PATH

deepspeed --master_port 25104 --num_gpus 4 main.py --data_path xxx --data_split 10,0,0 --model_name_or_path xxx --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 2048 --learning_rate 1e-3 --weight_decay 0.1 --num_train_epochs 3 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --lora_dim 128 --lora_module_name transformer.h. --only_optimize_lora --deepspeed --output_dir $OUTPUT_PATH &> $OUTPUT_PATH/training.log

ScottishFold007 commented 1 year ago

试试bfloat16、gradient_checkpointing=True

NostalgiaOfTime commented 1 year ago

@ScottishFold007 我看源码好像默认是使用float16，且gradient_checkpointing和only_optimize_lora不能同时使用，官方开源的代码想要LoRA就必须得放弃gradient_checkpointing。按道理说不应该站这么大显存的，毕竟用了LoRA，每层只有两个缩放矩阵需要BP，实际优化的参数量很小

AnnieHu1006 commented 1 year ago

您好，请问您跑通deepspeed chat时候的docker image里的配置是哪些？我这边一直报错好像是安装的包的版本之间有冲突，多谢

microsoft / DeepSpeed

[BUG]4张80G的A100好像不能支持基于lora的7b bloom在batch为4的条件下训练，Colossalai是可以的，比较困惑，我对比了一下，batch只能设置到1 #3361