mostlygeek / llama-swap

HTTP proxy for on-demand model loading with llama.cpp (or other OpenAI compatible backends)
MIT License
53 stars 1 forks source link

Support routing to multiple backends #7

Open mostlygeek opened 3 weeks ago

mostlygeek commented 3 weeks ago

For multi-gpu machines, be able to load multiple inference backends and route to them appropriately. For example:

                               +----------+              
         +-------------------> | qwen-72b |              
         |                     +----------+              
         |                        running on GPU #1,#2,#3
+--------+---+                                           
| llama-swap |                                           
+--------+---+                                           
         |                                               
         |                     +---------------+         
         +-------------------> | qwen-coder-7b |         
                               +---------------+         
                                  running on GPU #4      

The use case is to have more control and better utilization of local resources. Additionally, this would work well for software development where a larger model can be used for chat and a smaller, faster model for auto-complete.

With nvidia GPUs (on linux) this can be done using CUDA_VISIBLE_DEVICES in the environment.

mostlygeek commented 5 days ago

To not introduce a lot more complexity it is desirable to keep using API paths /v1/chat/completions and /v1/completions. So current design is to encode the group into the model name like: group/model or coding/qwen-coder-32b.

The configuration file can be like:

# Seconds to wait for llama.cpp to be available to serve requests
# Default (and minimum): 15 seconds
healthCheckTimeout: 15

models:
  "qwen-coder-1b":
    cmd: llama-server --port 9001 -m models/qwen-coder-0.5b.gguf
    proxy: http://127.0.0.1:9001

  "qwen-coder-32b":
    cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
    proxy: http://127.0.0.1:9002

groups:
  coding:
    - "qwen-coder-1b"
    - "qwen-coder-32b"

When calling with coding/qwen-coder-1b llama-swap will make sure the whole coding group is loaded. So calls to coding/qwen-coder-32b will be routed to an already running server. However, when calling with qwen-coder-1b it will unload the whole group and load only the single model.

There are some tradeoffs between configuration and complexity with this approach: