Support routing to multiple backends

To not introduce a lot more complexity it is desirable to keep using API paths /v1/chat/completions and /v1/completions. So current design is to encode the group into the model name like: group/model or coding/qwen-coder-32b.

The configuration file can be like:

# Seconds to wait for llama.cpp to be available to serve requests
# Default (and minimum): 15 seconds
healthCheckTimeout: 15

models:
  "qwen-coder-1b":
    cmd: llama-server --port 9001 -m models/qwen-coder-0.5b.gguf
    proxy: http://127.0.0.1:9001

  "qwen-coder-32b":
    cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
    proxy: http://127.0.0.1:9002

groups:
  coding:
    - "qwen-coder-1b"
    - "qwen-coder-32b"

When calling with coding/qwen-coder-1b llama-swap will make sure the whole coding group is loaded. So calls to coding/qwen-coder-32b will be routed to an already running server. However, when calling with qwen-coder-1b it will unload the whole group and load only the single model.

There are some tradeoffs between configuration and complexity with this approach:

it is simple to manage and understand
the listening addresses across grouped models can not conflict
there is no customization in the group settings for the models, a new model definition would need to be created

mostlygeek / llama-swap

Support routing to multiple backends #7