vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.92k stars 4.7k forks source link

[Performance]: There is a 10x performance gap between the lora-modules deployment model and the Merge deployment model #10664

Open LIUKAI0815 opened 1 day ago

LIUKAI0815 commented 1 day ago

Proposal to improve performance

vllm serve /workspace/model/llm/Qwen/Qwen2_5-3B-Instruct\ --host 0.0.0.0 \ --port 2017 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \ --lora-modules question_ext3B=/workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200\ --enable-lora \ --max-lora-rank 32 \

time:23.94313097000122

vllm serve /workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200-merged time:2.6456634998321533

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

vllm 0.6.4.post1

Before submitting a new issue...

jeejeelee commented 1 day ago

Hi, please try to remove --enforce-eager and test again

LIUKAI0815 commented 1 day ago

@jeejeelee The speed gap is not as big, but there is still a gap, and the merged model is better

jeejeelee commented 1 day ago

@jeejeelee The speed gap is not as big, but there is still a gap, and the merged model is better

Yes, compared to the merged model, LoRA incurs additional computation and leads to a performance gap. Could you plz provibe the running details

LIUKAI0815 commented 16 hours ago

export CUDA_VISIBLE_DEVICES=3 export VLLM_RPC_TIMEOUT=1800000 export VLLM_USE_MODELSCOPE= False vllm serve /workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200-merged \ --host 0.0.0.0 \ --port 2019 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --served-model-name qwen2_5-3b-instruct \