qwen-72b-chat和XVERSE-65B-chat使用8张A800进行lora+8跑不起来

hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)

https://arxiv.org/abs/2403.13372

Apache License 2.0

32.71k stars 4.01k forks source link

qwen-72b-chat和XVERSE-65B-chat使用8张A800进行lora+8跑不起来 #2520

Closed angel1288 closed 8 months ago

angel1288 commented 8 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

请教一下哈使用accelerate配置方式lora(rank=8)微调 qwen-72b-chat和XVERSE-65B-chat模型均跑不起来，一直oom；微调使用的精度fp16，4张卡跑量化int4是可以的，但是非量化的不行，请问这个资源本来就不够，还是需要添加特殊配置，谢谢

Expected behavior

No response

System Info

No response

Others

No response

hiyouga commented 8 months ago

使用 Deepspeed zero3 + offload

angel1288 commented 8 months ago

@hiyouga 您好，我使用了一下配置，可以跑起来了，但是loss一开始就是0呢 { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "bf16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

hiyouga commented 8 months ago

使用 https://github.com/xverse-ai/XVERSE-65B 里面提供的配置

angel1288 commented 8 months ago

使用 https://github.com/xverse-ai/XVERSE-65B 里面提供的配置

您好，用了这个配置，loss还是0；我现在这个使用的llmtuner=0.4.0 现在升级llmtuner=0.5.2试试哈

angel1288 commented 8 months ago

使用 https://github.com/xverse-ai/XVERSE-65B 里面提供的配置

您好，用了这个配置，loss还是0；我现在这个使用的llmtuner=0.4.0 现在升级llmtuner=0.5.2试试哈

@hiyouga 升级框架到llmtuner=0.5.2之后，用原来的配置也正常了哈~