vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.05k stars 3.82k forks source link

Performance issue when loading lora modules #3219

Open sleepwalker2017 opened 6 months ago

sleepwalker2017 commented 6 months ago

I compared two ways to launch the server.

The model is vicuna-7b, and GPU is 2 * A30.

and the 1st way is

python -m vllm.entrypoints.openai.api_server \
            --model /data/models/vicuna-7b-v1.5/ \
            --tensor-parallel-size 2  --gpu-memory-utilization 0.9 --enforce-eager --disable-log-requests

The 2nd way is:

python -m vllm.entrypoints.openai.api_server \
            --model /data/models/vicuna-7b-v1.5/ \
            --max-loras 16 --tensor-parallel-size 2  --max-lora-rank 64 --gpu-memory-utilization 0.9 \
            --enable-lora --enforce-eager --disable-log-requests --lora-modules lora1=/root/path1/  lora2=/root/path2/ ...

In both tests, I send the same request, which sets the model as /data/models/vicuna-7b-v1.5/.

But the performance differs a lot.

image

simon-mo commented 6 months ago

This is in line with what we expect. There are two optimization can be applied (help wanted!):

Only enable Punica kernels when LoRA is requested.

From discussion with @Yard1 offline

Currently we do all of the LoRA stuff even if there are no LoRAs loaded. I think it can be further optimized (orthogonally to getting better kernels).

This means, if anyone can help us conditionally enable Punica kernels for multi-lora only when LoRA is used, that would be wonderful.

Optimize Punica kernels for smaller, non-A100/H100 devices, even for lower sm7.5

sleepwalker2017 commented 6 months ago

This is in line with what we expect. There are two optimization can be applied (help wanted!):

Only enable Punica kernels when LoRA is requested.

From discussion with @Yard1 offline

Currently we do all of the LoRA stuff even if there are no LoRAs loaded. I think it can be further optimized (orthogonally to getting better kernels).

This means, if anyone can help us conditionally enable Punica kernels for multi-lora only when LoRA is used, that would be wonderful.

Optimize Punica kernels for smaller, non-A100/H100 devices, even for lower sm7.5

OK, I'm learning the multiple lora codes in vllm, I'll check the code for this, and if I'm capable to do this, I'll help to send a pr.

sleepwalker2017 commented 6 months ago

This is in line with what we expect. There are two optimization can be applied (help wanted!):

Only enable Punica kernels when LoRA is requested.

From discussion with @Yard1 offline

Currently we do all of the LoRA stuff even if there are no LoRAs loaded. I think it can be further optimized (orthogonally to getting better kernels).

This means, if anyone can help us conditionally enable Punica kernels for multi-lora only when LoRA is used, that would be wonderful.

Optimize Punica kernels for smaller, non-A100/H100 devices, even for lower sm7.5

Hello,still some points to confirm.

  1. When disabling lora, the performance is 311.5 token/sec,
  2. when using lora, but sending base model requests, I got 254.9 token/sec
  3. when using lora, sending lora requests, I got 196 token/sec.

When the optimization is done, the performance for 2 is the same as 1, but the 3 is still like that, is that right?

Seem the overhead for multiple lora is fairly large.