vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.75k stars 3.91k forks source link

[Usage]: How do I configure Phi-3-vision for high throughput? #7751

Open hommayushi3 opened 3 weeks ago

hommayushi3 commented 3 weeks ago

How would you like to use vllm

I want to run Phi-3-vision with VLLM to support parallel calls with high throughput. In my setup (openai compatible 0.5.4 VLLM server on HuggingFace Inference Endpoints with Nvidia-L4 24GB GPU), I have set up Phi-3-vision with the following parameters:

DISABLE_SLIDING_WINDOW=true
DTYPE=bfloat16
ENFORCE_EAGER=true   # Tried both true/false
GPU_MEMORY_UTILIZATION=0.98  # Tried 0.6-0.99
MAX_MODEL_LEN=3072  # Smallest token length that supports my work
MAX_NUM_BATCHED_TOKENS=12288  # Tried 3072-12288
MAX_NUM_SEQS=16  # Tried 2-32
QUANTIZATION=fp8  # Tried fp8 and None
TRUST_REMOTE_CODE=true
VLLM_ATTENTION_BACKEND=FLASH_ATTN

I am running into the issue that no matter what settings I use, adding more concurrent calls is increasing the total inference time linearly; the batching parallelism is not working. For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.

The logs show:

Avg prompt throughput: 3461 tokens/s, Avg generation throughput: 39.4 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 68.3%, CPU KV cache usage: 0.0%
Avg prompt throughput: 0 tokens/s, Avg generation throughput: 154.3 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 40.8%, CPU KV cache usage: 0.0%

Questions:

  1. Is this a configuration/usage issue? What other parameters might I be missing?
  2. Is this an issue with Phi-3-vision? (might be related to this issue)
  3. Would this be fixed with Phi-3.5-vision?
Dineshkumar-Anandan-ZS0367 commented 3 weeks ago

Are you deployed any vision language model across two machines, like pipeline parallelism. Can you able to suggest some ideas.

Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?

DarkLight1337 commented 3 weeks ago

Are you deployed any vision language model across two machines, like pipeline parallelism.

PP is not yet supported for vision language models (#7684). Also, the model has been fully TP'ed yet (#7186). The performance should improve after these PRs are completed.

DarkLight1337 commented 3 weeks ago

Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?

vLLM's server supports image input via OpenAI Chat Completions API. Please refer to OpenAI's docs for more details.

hommayushi3 commented 3 weeks ago

I don't think either of these are relevant for my issue. I am using a single Nvidia-L4, not a multi-gpu setup.

DarkLight1337 commented 3 weeks ago

I suggest profiling the code to see where is the bottleneck. It's possible that most of the execution time is taken up by the model forward pass, in which case there can hardly be any improvement from adjusting the batching params.

DarkLight1337 commented 3 weeks ago

@youkaichao @ywang96 perhaps you have a better idea of this?

youkaichao commented 3 weeks ago

definitely it needs profiling first.

ywang96 commented 1 week ago

For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.

@hommayushi3 Can you share the information on how you currently set up the workload, including

Without this information, we can't really help on how to optimize for your workload.