Open Cyb4Black opened 4 days ago
Yes, this is a limit I haven't resolved yet. The only solution I have in the short term is to just block and only process one request at a time, which is probably better done client side so you don't get request timeouts. Is this something that you need?
Dynamic batching is a much more complex solution with vision models, they don't have any consistent ways to batch, sometimes image contexts can be batched sometimes they can't. Most of the time only the chat can be batched, not the image context. this is inconsistent with the expectations of the API so I have not implemented it at all.
The only practical solution I can suggest is to load multiple copies of the server running on different ports, perhaps with a load balancer in front. This is not a good general solution because vision models are typically huge and this would require enormous vram. So again this is not implemented.
Actually for our hackathon we needed to be able to precess multiple requests in parallel and were lucky TGI by Huggingface just recently added Support for MLlama so we currently don't use openedai-vision for now.
Just wanted to make sure you are aware of the bug.
No problem, you may also be interested in knowing that vllm supports a few of the good vision models and is great for multiple concurrent requests.
See title. If you try to have it handle 2 Requests in parallel with streaming the response it starts, but dies half way through the answer.