Open Pavloveuge opened 2 months ago
I haven't done precise testing, but I think your scenario is as expected.
Oh, thank you. But do you have any hypotheses why this is so? What could be causing the slowdown? After all, these are requests that do not use adapters and I still have enough free memory for caches
It could be that there is an overhead that comes with un-applying the lora adapter to the model before processing the request. I could be wrong. Maybe @robertgshaw2-neuralmagic or others can clarify.
Oh, thank you. But do you have any hypotheses why this is so? What could be causing the slowdown? After all, these are requests that do not use adapters and I still have enough free memory for caches
Sorry for the delay feedback. Even if none of your requests use adapters, the LoRA-related code will still execute, such as calling LoRA kernels, which can introduce additional overhead. PS: You can verify this using profiler-related functions.
You can verify this using profiler-related functions.
You are talking about some settings through vllm debug? Or about arbitrary profilers?
Your current environment
I'm using
vllm/vllm-openai:v0.6.0
How would you like to use vllm
I already use vllm for inference of some models and everything is fine. Also, I have some load tests for my scenario of usage. Recently I wanted to add some LoRA models. After running my load tests (which make requests to the base model, not to Lora) on an instance with LoRA, I noticed that latency increased by about 5 -10% (vs instance without LoRA).
My base model - openchat3.6 (finetune of llama2), LoRA with r=16 on ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] layers.
I run vllm (for only base model) with:
for LoRA
I understand that using LoRA consumes additional GPU memory, which may affect amount of available memory for KV-cache, but my
GPU KV cache usage:
so far from 100%.I found issue, which was fixed. But I didn’t understand from PR with fix, is it expected that non-LoRa requests to a vllm instance with LoRA will slow down now?
Is it normal that I facing with slowdown in this scenario?
Before submitting a new issue...