用两张4090显卡微调chatglm3-6b模型out of memory，用的是lora + zero3的方式，请问哪里错了？我看官网7B建议的硬件内存是16G

hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)

https://arxiv.org/abs/2403.13372

Apache License 2.0

30.79k stars 3.8k forks source link

用两张4090显卡微调chatglm3-6b模型out of memory，用的是lora + zero3的方式，请问哪里错了？我看官网7B建议的硬件内存是16G #3094

Closed dongxu closed 5 months ago

dongxu commented 5 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

deepspeed --num_gpus 2 src/train_bash.py --deepspeed ds_z3_config.json --ddp_timeout 180000000 --stage sft --do_train True --model_name_or_path /home/bill/windows/code/glm3-6B --finetuning_type lora --template chatglm3 --dataset_dir data --dataset kaimen --cutoff_len 1024 --learning_rate 0.0002 --num_train_epochs 20.0 --max_samples 500 --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --output_dir saves/ChatGLM3-6B-Chat/lora/kaimen --fp16 True --lora_rank 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target all --use_dora True --plot_loss True 上面是我运行的命令脚本

Expected behavior

No response

System Info

No response

Others

No response

dongxu commented 5 months ago

下面是我的deepspeed配置文件内容 { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

hiyouga commented 5 months ago

换成 accelerate 试一下

liushiyangstd commented 5 months ago

请问这个硬件内存是16G说的是GPU的显存吗？还是其他什么

pppppkun commented 5 months ago

我遇到了同样的问题（4090*2，chatglm3-6b微调），或许可能是因为4090的架构不支持nvlink，导致显存无法共享

hiyouga commented 5 months ago

建议尝试下 Yi 6B 或 Qwen1.5

xhjh commented 3 months ago

请问这个问题是怎么解决的