Closed rjmehta1993 closed 5 months ago
Note: The model is not in async mode. Does this have to be in async to enable to the job to be added, removed, and executed automatically if cache is present to execute? Or can I run the model in a class obj and have the API in flask sync wrapper.
@turboderp Thanks for your help and suggestion on this one. I tried wrapping my head almost everywhere around the sync/async with llm and paged attention. And created this issue only when couldn't find the resources for LLM + sync flask. Please let me know if this is not the correct direction.
I would guess the problem here is that you end up with two threads calling iterate()
concurrently. The generator isn't threadsafe and honestly it's a little surprising that it doesn't just crash when used like that..?
Regardless, the trick would be to use tasks rather than threads. The async wrapper facilitates this nicely, by letting each job work as an independent generator, routing the batched streams automatically. I'm not sure if Flask has a single-threaded/async mode, though. Perhaps it would be easier to use something like Quart? Though I'm not an expert on that either.
This is how I use the Dynamic Generator as a class object and the server wrapped in the flask.
But the responses get mixed when I send 2 requests simultaneously (mimicking client).
Note: The multiple prompt is not predefined in a list object, but it is a queue where the job is added and removed when finished.
LOAD MODEL
LLM CLASS OBJ & FLASK WRAPPER
CLIENT REQUEST
OUTPUT
If you look at the logs in the terminal after pprint the result, the tokens are leaking in other jobs when executing each request. Can the idx "identifier" maintain isolation?