Open LIUKAI0815 opened 1 day ago
Hi, please try to remove --enforce-eager and test again
@jeejeelee The speed gap is not as big, but there is still a gap, and the merged model is better
@jeejeelee The speed gap is not as big, but there is still a gap, and the merged model is better
Yes, compared to the merged model, LoRA incurs additional computation and leads to a performance gap. Could you plz provibe the running details
export CUDA_VISIBLE_DEVICES=3 export VLLM_RPC_TIMEOUT=1800000 export VLLM_USE_MODELSCOPE= False vllm serve /workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200-merged \ --host 0.0.0.0 \ --port 2019 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --served-model-name qwen2_5-3b-instruct \
Proposal to improve performance
vllm serve /workspace/model/llm/Qwen/Qwen2_5-3B-Instruct\ --host 0.0.0.0 \ --port 2017 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \ --lora-modules question_ext3B=/workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200\ --enable-lora \ --max-lora-rank 32 \
time:23.94313097000122
vllm serve /workspace/output/question_extration/qwen/qwen2_5-3b-instruct/v0-20241122-142013/checkpoint-1200-merged time:2.6456634998321533
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
vllm 0.6.4.post1
Before submitting a new issue...