Qwen1.5-MoE-A2.7B-Chat推理速度很慢

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

python src/api_demo.py \ --model_name_or_path /Qwen1.5-MoE-A2.7B-Chat \ --adapter_name_or_path /qwen1_5_moe_a2.7b_contract_200_sft_90 \ --template qwen \ --finetuning_type lora \ --max_new_tokens 28672

推理环境： Python=3.8.18 CUDA=12.2，单张A100 80G显卡

torch=2.1.2 transformers==4.41.0.dev0 peft==0.10.0 accelerate==0.28.0 gradio==3.48.0 trl==0.8.6 datasets==2.15.0

Expected behavior

Qwen1.5-MoE-A2.7B-Chat的推理耗时是Qwen1.5-7B-Chat的4倍多，请问要怎么优化？

System Info

Centos 7.6

Others

No response

hiyouga / LLaMA-Factory

Qwen1.5-MoE-A2.7B-Chat推理速度很慢 #3501

Reminder

Reproduction

Expected behavior

System Info

Others