Closed yecphaha closed 6 months ago
python src/api_demo.py \ --model_name_or_path /Qwen1.5-MoE-A2.7B-Chat \ --adapter_name_or_path /qwen1_5_moe_a2.7b_contract_200_sft_90 \ --template qwen \ --finetuning_type lora \ --max_new_tokens 28672
推理环境: Python=3.8.18 CUDA=12.2,单张A100 80G显卡
torch=2.1.2 transformers==4.41.0.dev0 peft==0.10.0 accelerate==0.28.0 gradio==3.48.0 trl==0.8.6 datasets==2.15.0
Qwen1.5-MoE-A2.7B-Chat的推理耗时是Qwen1.5-7B-Chat的4倍多,请问要怎么优化?
Centos 7.6
No response
可能是 MoE 官方实现上的效率不够高,目前没有办法解决
Reminder
Reproduction
python src/api_demo.py \ --model_name_or_path /Qwen1.5-MoE-A2.7B-Chat \ --adapter_name_or_path /qwen1_5_moe_a2.7b_contract_200_sft_90 \ --template qwen \ --finetuning_type lora \ --max_new_tokens 28672
推理环境: Python=3.8.18 CUDA=12.2,单张A100 80G显卡
torch=2.1.2 transformers==4.41.0.dev0 peft==0.10.0 accelerate==0.28.0 gradio==3.48.0 trl==0.8.6 datasets==2.15.0
Expected behavior
Qwen1.5-MoE-A2.7B-Chat的推理耗时是Qwen1.5-7B-Chat的4倍多,请问要怎么优化?
System Info
Centos 7.6
Others
No response