Issues with Concurrent Request Handling using exllamav2 and Flask Streaming

iammrj commented 3 months ago

Description

I am experiencing issues when trying to handle multiple concurrent requests using exllamav2 for GPTQ quantised model inference in a Flask server environment. Specifically, when one request is being processed and another request comes in, I encounter problems such as incomplete responses and mixing up answers for different questions. I am utilising the streaming capabilities of exllamav2 over Server-Sent Events (SSE) in Python Flask to achieve real-time inference feedback.

Environment

RAM: 24GB GPU: NVIDIA (nvidia-cuda-runtime-cu12: 12.1.105) Model: 7b GPTQ Quantized Model Framework/Library: exllamav2, Flask Deployment: Kubernetes

Expected Behavior

When multiple requests are sent to the Flask server, each request should be processed independently, and the streaming response should be specific to each request without interference.

Current Behavior

When a request is being processed and another request is sent concurrently, the responses sometimes are incomplete or mixed up between the two different requests.

Steps to Reproduce

Start the Flask server with exllamav2 integration for LLM inference.
Send a request to the server and, while it's processing, immediately send another request.
Observe that the responses for both requests are either incomplete or incorrect (mixed up).

Seeking Advice on Implementing a Queue for Handling Concurrent Requests

In addition to resolving the concurrency issue described above, I'm considering the possibility of introducing a queuing mechanism for managing incoming requests. The goal is to serialize the requests, ensuring each one is fully processed before moving on to the next, while still leveraging the real-time streaming capabilities of SSE with exllamav2 in Flask.

Key Considerations:

Queue Implementation: What would be the recommended way to implement such a queue in a Kubernetes-deployed Flask application? Are there preferred Python libraries or Kubernetes-native solutions that integrate well with Flask and exllamav2 for this purpose? Maintaining Streaming Responses: How can we ensure that the real-time nature of SSE is maintained, allowing clients to receive streaming responses while their requests are queued and processed in turn? Scalability and Performance: How might introducing a queue affect the overall scalability and performance of the application, especially considering the resource-intensive nature of LLM inference with NVIDIA GPUs? Are there patterns or best practices for balancing the queuing delay against the computational load?

I am open to any suggestions, including changes to the application architecture, deployment strategies, or even alternative approaches to handling concurrent requests more effectively. Insights on how to effectively queue requests in a way that complements our current streaming setup would be greatly appreciated.

Additional Context

I'm aiming for a robust solution that allows multiple concurrent requests to be handled seamlessly without compromising the integrity of the responses. Any insights or suggestions on how to better architect this setup or adjust configurations to prevent such issues would be greatly appreciated.

turboderp commented 2 months ago

The best answer is probably paged attention. This was added to flash-attn, and I intend to also add support at some point. It would require a new generator pipeline and a new method for managing the cache but it would allow you to dynamically scale the batch size, and rather than allocating space for a fixed number of generations at a fixed max length for each, you would allocate a total token count which could be distributed as needed between concurrent generations. Lots of complications down that path, obviously, so I'm not sure when it will happen exactly.

As for what you're seeing, the implementation isn't thread-safe. It has a number of buffers for inference at a given chunk length and batch size, and it doesn't dynamically allocate new ones if you call model.forward in two separate threads. You would just get undefined behavior as the two threads overwrite each other's intermediate values.

Likewise, the generators are stateful. Each maintains one (batched) completion and manages one cache for that process. For the streaming generator, for instance, if you call begin_stream_ex in the middle of another completion, that ongoing completion is corrupted as the state is overwritten with the second prompt.

The model itself is stateless, so you could have multiple generators working in turn on the same model. There's also this example of how to use multiple caches to gain some of the benefits of batching without having to run at a constant batch size.

If all you want to do is queue requests, that shouldn't be much of an issue. Just make sure the previous generation is finished before you reuse the same resources (generator+cache) for the next one. If Flask receives requests asynchronously, stash all incoming requests in a FIFO queue and have the main thread serve them one after another.

Overall though, efficiently managing concurrent requests to an engine that demands 100% of the available VRAM and compute resources is non-trivial. You probably don't want to limit the context length per user to accommodate more users, for instance, but you also can't dynamically allocate resources as required since that's inefficient and prone to memory fragmentation. It's really paged attention you want in this case.

Personally though, I feel like people usually get the wrong idea about ExLlama. It can do a lot with a limited amount of VRAM, but it does this primarily by quantizing weights. And as you scale a server deployment to more and more users, the weights make up less and less of the overall amount of VRAM you'll need. You'll be more limited by other factors at some point, like efficient K/V cache management and dynamic batching, and you're going to need a big stack of GPUs regardless.

turboderp commented 2 months ago

I should also add that there are existing projects you can look at with different takes on how to do this. Take a look at tabbyAPI and EricLLM

turboderp commented 1 month ago

The new generator implements dynamic batching using paged attention (requires Flash Attention 2.5.7+). It pretty much addresses all of this, and you can check it out in the dev branch for now. I expect to bump the version in the next couple of days.

turboderp / exllamav2