hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.05k stars 3.7k forks source link

baichuan2-13b全参数训练,报out of memory #1158

Closed FanWan closed 10 months ago

FanWan commented 10 months ago

训练环境: 4 * A800 80G显卡

启动脚本:

deepspeed --include localhost:4,5,6,7 --master_port $MASTER_PORT src/train_bash.py --stage sft --model_name_or_path /home/work/record/llm_models/Baichuan2-13B-Chat --do_train --cutoff_len 1536 --max_length 320 --overwrite_output_dir --dataset train_intent_args_all --template baichuan2 --finetuning_type full --output_dir output/$SAVE_MODEL_PATH --overwrite_cache --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine --logging_steps 100 --save_steps 400 --learning_rate 5e-5 --num_train_epochs 5.0 --plot_loss --deepspeed dsconfig.json --bf16 > log/train${SAVE_MODEL_PATH}.log 2>&1 &

错误日志: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.94 GiB (GPU 1; 79.35 GiB total capacity; 77.65 GiB already allocated; 316.12 MiB free; 77.66 GiB reserved in total by PyTorch) If reserv ed memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

hiyouga commented 10 months ago

ZeRO-2 需要 8 张 A100,4 张需要开 ZeRO-3

askcs517 commented 4 months ago

ZeRO-2 需要 8 张 A100,4 张需要开 ZeRO-3

我使用zero-3 全参微调, 单机8卡:8张A100(40G)也会报oom,你这个240G是理论值吗? 是否有其他占用?不想用zero-3 offload,各种参数已经调低了无效, 是否有其他方法呢? @hiyouga

hiyouga commented 4 months ago

理论估算值

askcs517 commented 4 months ago

您好,您的邮件我已经收到,我会及时查收。麻烦您了。