Closed leoterry-ulrica closed 10 months ago
accelerate 并不能减少每张卡上的显存占用 使用 deepspeed zero3 将模型分到多张卡上 https://github.com/xverse-ai/XVERSE-65B?tab=readme-ov-file#%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83
accelerate 并不能减少每张卡上的显存占用 使用 deepspeed zero3 将模型分到多张卡上 https://github.com/xverse-ai/XVERSE-65B?tab=readme-ov-file#%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83
依然会提示内存不足。
训练脚本:
deepspeed --num_gpus 2 src/train_bash.py \
--deepspeed deep_speed.json \
--stage sft \
--do_train True \
--model_name_or_path /root/autodl-tmp/Baichuan2-13B-Chat \
--finetuning_type lora \
--template baichuan2 \
--dataset_dir data \
--dataset self_cognition \
--cutoff_len 1024 \
--learning_rate 5e-05 \
--num_train_epochs 3 \
--max_samples 100000 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--neftune_noise_alpha 0 \
--lora_rank 8 \
--lora_dropout 0.1 \
--lora_target W_pack \
--output_dir saves/Baichuan2-13B-Chat/lora/train_2023-12-22-17-54-04 \
--bf16 True \
--plot_loss True
deep_speed.json内容:
{
"train_micro_batch_size_per_gpu":"auto",
"gradient_accumulation_steps":"auto",
"gradient_clipping":"auto",
"zero_allow_untested_optimizer":true,
"fp16":{
"enabled":false
},
"bfloat16":{
"enabled":true
},
"zero_optimization":{
"stage":3,
"allgather_partitions":true,
"reduce_scatter":true,
"overlap_comm":false,
"contiguous_gradients":true
}
}
错误:
Parameter Offload: Total persistent parameters: 2053120 in 121 params
0%| | 0/6 [00:00<?, ?it/s]Traceback (most recent call last):
File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 71, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in compute_loss
outputs = model(**inputs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
loss = self.module(*inputs, **kwargs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1073, in forward
return self.base_model(
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 103, in forward
return self.model.forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 705, in forward
logits = self.lm_head(hidden_states)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 513, in forward
norm_weight = nn.functional.normalize(self.weight)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/functional.py", line 4661, in normalize
return input / denom
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.20 GiB (GPU 1; 23.65 GiB total capacity; 17.21 GiB already allocated; 678.06 MiB free; 22.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-12-23 22:52:24,054] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 5303
[2023-12-23 22:52:24,471] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 5304
@hiyouga deepspeed命令有问题吗?依然是OOM。
减小batchsize
减小batchsize
deep_speed.json添加以下内容解决:
"offload_param": {
"device": "cpu",
"pin_memory": true
}
另外train_web.py中加载模型支持多卡不?单卡加载不完13B模型。 @hiyouga
@leoterry-ulrica web_demo 支持多卡加载
@leoterry-ulrica web_demo 支持多卡加载
请问启动命令里面是添加什么参数让它能够多卡加载13B模型,避免一张卡加载出现OOM? 现在的启动命令如下,是否需要修改?
python src/web_demo.py \
--model_name_or_path /root/autodl-tmp/Baichuan2-13B-Chat \
--adapter_name_or_path saves/Baichuan2-13B-Chat/lora/train_2023-12-24-09-54-04 \
--template baichuan2 \
--finetuning_type lora
@leoterry-ulrica web_demo 支持多卡加载
请问启动命令里面是添加什么参数让它能够多卡加载13B模型,避免一张卡加载出现OOM? 现在的启动命令如下,是否需要修改?
python src/web_demo.py \ --model_name_or_path /root/autodl-tmp/Baichuan2-13B-Chat \ --adapter_name_or_path saves/Baichuan2-13B-Chat/lora/train_2023-12-24-09-54-04 \ --template baichuan2 \ --finetuning_type lora
@hiyouga 大佬,web_demo怎么设置多卡呢?直接设置CUDA_VISIBLE_DEVICES=4,5,6,7的话4*80GB加载Qwen1.5-72B-Chat模型显存会OOM.看显卡使用情况,只使用了4号卡.在issues中没有找到对应的示例. 用vllm的话2*80GB就可以加载4096长度的同一模型. 代码版本大概是2024.03.04 启动脚本
PYTHONPATH=***/LLaMA-Factory CUDA_VISIBLE_DEVICES=4,5,6,7 WEB_PORT=6072 nohup python3 ./src/web_demo.py \
--model_name_or_path ***/pretrain_model/Qwen1.5-72B-Chat \
--template qwen \
--finetuning_type full \
--repetition_penalty 1.2 \
--cutoff_len 8192 > qwen1.5-72B-web.out &
@luoqishuai 请更新代码到最新版
新版代码中该问题已解决,多谢大佬!
Reminder
Reproduction
Expected behavior
每张RTX4090卡,24G显存,但依然提示内存不足:
System Info
Others
No response