Open mostlygeek opened 3 weeks ago
To not introduce a lot more complexity it is desirable to keep using API paths /v1/chat/completions
and /v1/completions
. So current design is to encode the group into the model name like: group/model
or coding/qwen-coder-32b
.
The configuration file can be like:
# Seconds to wait for llama.cpp to be available to serve requests
# Default (and minimum): 15 seconds
healthCheckTimeout: 15
models:
"qwen-coder-1b":
cmd: llama-server --port 9001 -m models/qwen-coder-0.5b.gguf
proxy: http://127.0.0.1:9001
"qwen-coder-32b":
cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
proxy: http://127.0.0.1:9002
groups:
coding:
- "qwen-coder-1b"
- "qwen-coder-32b"
When calling with coding/qwen-coder-1b
llama-swap will make sure the whole coding
group is loaded. So calls to coding/qwen-coder-32b
will be routed to an already running server. However, when calling with qwen-coder-1b
it will unload the whole group and load only the single model.
There are some tradeoffs between configuration and complexity with this approach:
model
definition would need to be created
For multi-gpu machines, be able to load multiple inference backends and route to them appropriately. For example:
The use case is to have more control and better utilization of local resources. Additionally, this would work well for software development where a larger model can be used for chat and a smaller, faster model for auto-complete.
With nvidia GPUs (on linux) this can be done using
CUDA_VISIBLE_DEVICES
in the environment.