turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Question about example_flask.py #235

Open ZeroYuJie opened 11 months ago

ZeroYuJie commented 11 months ago

I found an example regarding using Flask for API requests. I gave it a try, but when making concurrent requests, the generated responses from the inference appear as garbled text. I suspect this might be due to concurrent inference for two questions. Is it possible to perform answer generation concurrently?

turboderp commented 11 months ago

There's no support for concurrency, no. You'd need a separate instance for each thread, with its own generator and cache, and some mechanism for sensibly splitting the work between threads, given that the implementation completely occupies the GPU.

You could possibly have a streaming API that dispatches to multiple generators when there are concurrent requests, but you'd need a lot of VRAM to accommodate that.