inference from the same model for multiple concurrent users

In a production environment, we receive numerous simultaneous requests and want to respond to them as quickly as possible. However, it seems that an instance of the model can only respond to one request at a time (until it's instructed to stop generating tokens or if it has finished), and loading multiple instances into memory is unfeasible and slow – the operating system scheduler would go crazy! I haven't seen the complete source code, but I suspect that many threads are instantiated and only a few or just one process. If that's the case, at least the model will save resources compared to using multiprocessing, but we will still face difficulties in handling multiple concurrent requests. One solution is to use a software design pattern like the Worker or Producer-Consumer pattern to create fixed instances (meaning we'll have x threads to handle various simultaneous requests).

Worker Design:

The Manager receives 300 simultaneous user requests and assigns them to Workers.
Each Worker processes a token, stores some conversation state in Redis (I'm still not sure how to exactly collect states from llama.cpp), and then gives way to the next request (context-switch).
While some workers are accepting new requests after processing a token, other sibling workers can continue processing pending requests and return to step 2.

This software model describes how I envision handling multiple requests at the same time without creating an instance for each request. However, I'm still unsure about how to do the following:

How to obtain conversation states, where exactly it left off? The worker interrupts processing to perform a context-switch and handle another request, so it's necessary to store the states somewhere.
Is there a more sophisticated approach than mine? I'm not aware of one.
I'm familiar with the functions EnablePromptCacheAll and SaveState/LoadState, but what exactly do they do? Are these the functions needed to save the state? However, I want to save it in Redis and not on disk (because it's slow).

Here's an additional question:

How to disable logs?

"llama.133123.log" is filling up my disk, and I haven't figured out how to turn off this annoying thing! In production, I usually use a more sophisticated way to manage logs in a scalable manner.

go-skynet / go-llama.cpp

inference from the same model for multiple concurrent users #221

Worker Design: