baichuan2-13b全参数训练，报out of memory

FanWan commented 10 months ago

训练环境： 4 * A800 80G显卡

启动脚本：

deepspeed --include localhost:4,5,6,7 --master_port $MASTER_PORT src/train_bash.py --stage sft --model_name_or_path /home/work/record/llm_models/Baichuan2-13B-Chat --do_train --cutoff_len 1536 --max_length 320 --overwrite_output_dir --dataset train_intent_args_all --template baichuan2 --finetuning_type full --output_dir output/$SAVE_MODEL_PATH --overwrite_cache --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine --logging_steps 100 --save_steps 400 --learning_rate 5e-5 --num_train_epochs 5.0 --plot_loss --deepspeed dsconfig.json --bf16 > log/train${SAVE_MODEL_PATH}.log 2>&1 &

错误日志： torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.94 GiB (GPU 1; 79.35 GiB total capacity; 77.65 GiB already allocated; 316.12 MiB free; 77.66 GiB reserved in total by PyTorch) If reserv ed memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

hiyouga commented 10 months ago

ZeRO-2 需要 8 张 A100，4 张需要开 ZeRO-3

askcs517 commented 4 months ago

ZeRO-2 需要 8 张 A100，4 张需要开 ZeRO-3

我使用zero-3 全参微调，单机8卡：8张A100（40G）也会报oom，你这个240G是理论值吗？是否有其他占用？不想用zero-3 offload，各种参数已经调低了无效，是否有其他方法呢？ @hiyouga

hiyouga commented 4 months ago

理论估算值

askcs517 commented 4 months ago

您好，您的邮件我已经收到，我会及时查收。麻烦您了。

hiyouga / LLaMA-Factory

baichuan2-13b全参数训练，报out of memory #1158