go-skynet / go-llama.cpp

LLama.cpp golang bindings
MIT License
650 stars 79 forks source link

inference from the same model for multiple concurrent users #221

Closed alph4b3th closed 11 months ago

alph4b3th commented 11 months ago

In a production environment, we receive numerous simultaneous requests and want to respond to them as quickly as possible. However, it seems that an instance of the model can only respond to one request at a time (until it's instructed to stop generating tokens or if it has finished), and loading multiple instances into memory is unfeasible and slow – the operating system scheduler would go crazy! I haven't seen the complete source code, but I suspect that many threads are instantiated and only a few or just one process. If that's the case, at least the model will save resources compared to using multiprocessing, but we will still face difficulties in handling multiple concurrent requests. One solution is to use a software design pattern like the Worker or Producer-Consumer pattern to create fixed instances (meaning we'll have x threads to handle various simultaneous requests).

Worker Design:

  1. The Manager receives 300 simultaneous user requests and assigns them to Workers.
  2. Each Worker processes a token, stores some conversation state in Redis (I'm still not sure how to exactly collect states from llama.cpp), and then gives way to the next request (context-switch).
  3. While some workers are accepting new requests after processing a token, other sibling workers can continue processing pending requests and return to step 2.

This software model describes how I envision handling multiple requests at the same time without creating an instance for each request. However, I'm still unsure about how to do the following:

Here's an additional question:

"llama.133123.log" is filling up my disk, and I haven't figured out how to turn off this annoying thing! In production, I usually use a more sophisticated way to manage logs in a scalable manner.