Open ZeroYuJie opened 11 months ago
There's no support for concurrency, no. You'd need a separate instance for each thread, with its own generator and cache, and some mechanism for sensibly splitting the work between threads, given that the implementation completely occupies the GPU.
You could possibly have a streaming API that dispatches to multiple generators when there are concurrent requests, but you'd need a lot of VRAM to accommodate that.
I found an example regarding using Flask for API requests. I gave it a try, but when making concurrent requests, the generated responses from the inference appear as garbled text. I suspect this might be due to concurrent inference for two questions. Is it possible to perform answer generation concurrently?