vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.3k stars 4.59k forks source link

[Performance]: Clarification on Base Model Inference Count with Multiple LoRA Models in vLLM Deployment #8228

Open zhangyuqi-1 opened 2 months ago

zhangyuqi-1 commented 2 months ago

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

Question:

When deploying LoRA with vLLM, suppose I have 1000 different LoRA models, and each LoRA receives a separate request with a different input. In this scenario, how many times does the base model actually perform inference? Is it only once, or does it perform 1000 inferences?

I understand that the LoRA part will run 1000 times, but its computational cost is relatively small. I'm mainly concerned about how many times the base model runs inference in this case. If the base model only runs once, that would be incredibly efficient, meaning that as the number of LoRA models increases, the overall efficiency would improve significantly. Is this possible?

Your current environment (if you think it is necessary)

No response

Before submitting a new issue...

jeejeelee commented 2 months ago

When deploying LoRA with vLLM, suppose I have 1000 different LoRA models, and each LoRA receives a separate request with a different input. In this scenario, how many times does the base model actually perform inference? Is it only once, or does it perform 1000 inferences?

Only once. If you want to delve deeper, see: https://github.com/vllm-project/vllm/pull/1804

zhangyuqi-1 commented 2 months ago

It’s fascinating! Does it batch requests corresponding to different LoRA models into one batch size, allowing it to only perform inference once? I’m curious how this is achieved.

jeejeelee commented 2 months ago

It’s fascinating! Does it batch requests corresponding to different LoRA models into one batch size, allowing it to only perform inference once? I’m curious how this is achieved.

See: punica paper or it's blog

zhangyuqi-1 commented 2 months ago

It’s fascinating! Does it batch requests corresponding to different LoRA models into one batch size, allowing it to only perform inference once? I’m curious how this is achieved.

See: punica paper or it's blog

thanks!