[Performance]: There is a 10x performance gap between the lora-modules deployment model and the Merge deployment model

LIUKAI0815 commented 1 day ago

Proposal to improve performance

vllm serve /workspace/model/llm/Qwen/Qwen2_5-3B-Instruct\ --host 0.0.0.0 \ --port 2017 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \ --lora-modules question_ext3B=/workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200\ --enable-lora \ --max-lora-rank 32 \

time：23.94313097000122

vllm serve /workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200-merged time：2.6456634998321533

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

vllm 0.6.4.post1

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

jeejeelee commented 1 day ago

Hi, please try to remove --enforce-eager and test again

LIUKAI0815 commented 1 day ago

@jeejeelee The speed gap is not as big, but there is still a gap, and the merged model is better

jeejeelee commented 1 day ago

@jeejeelee The speed gap is not as big, but there is still a gap, and the merged model is better

Yes, compared to the merged model, LoRA incurs additional computation and leads to a performance gap. Could you plz provibe the running details

LIUKAI0815 commented 16 hours ago

export CUDA_VISIBLE_DEVICES=3 export VLLM_RPC_TIMEOUT=1800000 export VLLM_USE_MODELSCOPE= False vllm serve /workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200-merged \ --host 0.0.0.0 \ --port 2019 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --served-model-name qwen2_5-3b-instruct \

vllm-project / vllm