Open hommayushi3 opened 3 weeks ago
Are you deployed any vision language model across two machines, like pipeline parallelism. Can you able to suggest some ideas.
Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?
Are you deployed any vision language model across two machines, like pipeline parallelism.
PP is not yet supported for vision language models (#7684). Also, the model has been fully TP'ed yet (#7186). The performance should improve after these PRs are completed.
Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?
vLLM's server supports image input via OpenAI Chat Completions API. Please refer to OpenAI's docs for more details.
I don't think either of these are relevant for my issue. I am using a single Nvidia-L4, not a multi-gpu setup.
I suggest profiling the code to see where is the bottleneck. It's possible that most of the execution time is taken up by the model forward pass, in which case there can hardly be any improvement from adjusting the batching params.
@youkaichao @ywang96 perhaps you have a better idea of this?
definitely it needs profiling first.
For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.
@hommayushi3 Can you share the information on how you currently set up the workload, including
LLM
classWithout this information, we can't really help on how to optimize for your workload.
How would you like to use vllm
I want to run Phi-3-vision with VLLM to support parallel calls with high throughput. In my setup (openai compatible 0.5.4 VLLM server on HuggingFace Inference Endpoints with Nvidia-L4 24GB GPU), I have set up Phi-3-vision with the following parameters:
I am running into the issue that no matter what settings I use, adding more concurrent calls is increasing the total inference time linearly; the batching parallelism is not working. For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.
The logs show:
Questions: