hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.2k stars 4.08k forks source link

Qwen1.5-MoE-A2.7B-Chat推理速度很慢 #3501

Closed yecphaha closed 6 months ago

yecphaha commented 6 months ago

Reminder

Reproduction

python src/api_demo.py \ --model_name_or_path /Qwen1.5-MoE-A2.7B-Chat \ --adapter_name_or_path /qwen1_5_moe_a2.7b_contract_200_sft_90 \ --template qwen \ --finetuning_type lora \ --max_new_tokens 28672

推理环境: Python=3.8.18 CUDA=12.2,单张A100 80G显卡

torch=2.1.2 transformers==4.41.0.dev0 peft==0.10.0 accelerate==0.28.0 gradio==3.48.0 trl==0.8.6 datasets==2.15.0

Expected behavior

Qwen1.5-MoE-A2.7B-Chat的推理耗时是Qwen1.5-7B-Chat的4倍多,请问要怎么优化?

System Info

Centos 7.6

Others

No response

hiyouga commented 6 months ago

可能是 MoE 官方实现上的效率不够高,目前没有办法解决