gpustack / llama-box

LLM inference server implementation based on llama.cpp.
MIT License
34 stars 5 forks source link
cpp gguf llama openai-compatible-api

LLaMA Box

LLaMA box is a clean, pure API(without frontend assets) LLMs inference server rather than llama-server.

Agenda

Features

Supports

Download LLaMA Box from the latest release page please, now LLaMA Box supports the following platforms.

Backend OS/Arch Device Requirement
NVIDIA CUDA 12.4 linux/amd64
windows/amd64
Compute capability matches 6.0, 6.1, 7.0, 7.5 ,8.0, 8.6, 8.9 or 9.0, see
https://developer.nvidia.com/cuda-gpus.
Driver version requires >=525.60.13, see
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id4.
AMD ROCm/HIP 6.1 linux/amd64
windows/amd64
LLVM target matches gfx906 (linux only), gfx908 (linux only), gfx90a (linux only), gfx942 (linux only), gfx1030, gfx1100, gfx1101 or gfx1102, see
https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.1.2/reference/system-requirements.html,
https://rocm.docs.amd.com/projects/install-on-windows/en/docs-6.1.2/reference/system-requirements.html.
Intel oneAPI 2025.0 linux/amd64
windows/amd64
Support Intel oneAPI, see
https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html.
Huawei Ascend CANN 8.0 linux/amd64
linux/arm64
Ascend 910b, see
https://www.hiascend.com/document/detail/en/CANNCommunityEdition/600alphaX/softwareinstall/instg/atlasdeploy_03_0015.html.
Moore Threads MUSA rc3.1 linux/amd64
MTT S4000, MTT S80, see
https://en.mthreads.com.
Apple Metal 3 darwin/amd64
darwin/arm64
Support Apple Metal, see
https://support.apple.com/en-sg/102894.
AVX2 darwin/amd64
linux/amd64
windows/amd64
CPUs support AVX2, see
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2.
Advanced SIMD (NEON) linux/arm64
windows/arm64
CPUs support Advanced SIMD (NEON), see
https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon).
AVX512 linux/amd64
windows/amd64
CPUs support AVX512, see
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#AVX-512.

[!NOTE]

Since v0.0.60, the build of Linux releases are as follows:

  • "NVIDIA CUDA 12.4" and "AMD ROCm/HIP 6.1" releases are built on CentOS 7 (glibc 2.17),
  • "Intel oneAPI 2025.0" releases are built on Ubuntu 22.04 (glibc 2.34).
  • "Huawei Ascend CANN 8.0" releases are built on Ubuntu 20.04 (glibc 2.31) and OpenEuler 20.03 (glibc 2.28).
  • "Moore Threads MUSA rc3.1" releases are built on Ubuntu 22.04 (glibc 2.34).
  • "AVX2" releases are built on CentOS 7 (glibc 2.17).
  • "Advanced SIMD (NEON)" releases are built on Ubuntu 18.04 (glibc 2.27).
  • "AVX512" releases are built on RockyLinux 8.9 (glibc 2.28).

Examples

Note: LM Studio provides a fantastic UI for downloading the GGUF model from Hugging Face. The GGUF model files used in the following examples are downloaded via LM Studio.

Usage

usage: llama-box [options]

general:

  -h,    --help, --usage          print usage and exit
         --version                print version and exit
         --system-info            print system info and exit
         --list-devices           print list of available devices and exit
  -v,    --verbose, --log-verbose 
                                  set verbosity level to infinity (i.e. log all messages, useful for debugging)
  -lv,   --verbosity, --log-verbosity V
                                  set the verbosity threshold, messages with a higher verbosity will be ignored
         --log-colors             enable colored logging

server:

         --host HOST              ip address to listen (default: 127.0.0.1)
         --port PORT              port to listen (default: 8080)
  -to    --timeout N              server read/write timeout in seconds (default: 600)
         --threads-http N         number of threads used to process HTTP requests (default: -1)
         --conn-idle N            server connection idle in seconds (default: 60)
         --conn-keepalive N       server connection keep-alive in seconds (default: 15)
  -m,    --model FILE             model path (default: models/7B/ggml-model-f16.gguf)
  -a,    --alias NAME             model name alias (default: unknown)
         --lora FILE              apply LoRA adapter (implies --no-mmap)
         --lora-scaled FILE SCALE 
                                  apply LoRA adapter with user defined scaling S (implies --no-mmap)
         --lora-init-without-apply
                                  load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled)
  -s,    --seed N                 RNG seed (default: -1, use random seed for -1)
  -mg,   --main-gpu N             the GPU to use for the model (default: 0)
         --metrics                enable prometheus compatible metrics endpoint (default: disabled)
         --infill                 enable infill endpoint (default: disabled)
         --embeddings             enable embedding endpoint (default: disabled)
         --images                 enable image endpoint (default: disabled)
         --rerank                 enable reranking endpoint (default: disabled)
         --slots                  enable slots monitoring endpoint (default: disabled)
         --rpc SERVERS            comma separated list of RPC servers

server/completion:

  -dev,  --device <dev1,dev2,...> 
                                  comma-separated list of devices to use for offloading (none = don't offload)
                                  use --list-devices to see a list of available devices
  -ngl,  --gpu-layers,  --n-gpu-layers N
                                  number of layers to store in VRAM
  -sm,   --split-mode SPLIT_MODE  how to split the model across multiple GPUs, one of:
                                    - none: use one GPU only
                                    - layer (default): split layers and KV across GPUs
                                    - row: split rows across GPUs, store intermediate results and KV in --main-gpu
  -ts,   --tensor-split SPLIT     fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
         --override-kv KEY=TYPE:VALUE
                                  advanced option to override model metadata by key. may be specified multiple times.
                                  types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false
         --chat-template JINJA_TEMPLATE
                                  set custom jinja chat template (default: template taken from model's metadata)
                                  only commonly used templates are accepted: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
         --chat-template-file FILE
                                  set a file to load a custom jinja chat template (default: template taken from model's metadata)
         --slot-save-path PATH    path to save slot kv cache (default: disabled)
  -sps,  --slot-prompt-similarity N
                                  how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)

  -tps   --tokens-per-second N    maximum number of tokens per second (default: 0, 0 = disabled, -1 = try to detect)
                                  when enabled, limit the request within its X-Request-Tokens-Per-Second HTTP header
  -t,    --threads N              number of threads to use during generation (default: -1)
  -C,    --cpu-mask M             set CPU affinity mask: arbitrarily long hex. Complements cpu-range (default: "")
  -Cr,   --cpu-range lo-hi        range of CPUs for affinity. Complements --cpu-mask
         --cpu-strict <0|1>       use strict CPU placement (default: 0)

         --prio N                 set process/thread priority (default: 0), one of:
                                    - 0-normal
                                    - 1-medium
                                    - 2-high
                                    - 3-realtime
         --poll <0...100>         use polling level to wait for work (0 - no polling, default: 50)

  -tb,   --threads-batch N        number of threads to use during batch and prompt processing (default: same as --threads)
  -Cb,   --cpu-mask-batch M       set CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch (default: same as --cpu-mask)
  -Crb,  --cpu-range-batch lo-hi  ranges of CPUs for affinity. Complements --cpu-mask-batch
         --cpu-strict-batch <0|1> 
                                  use strict CPU placement (default: same as --cpu-strict)
         --prio-batch N           set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: --priority)
         --poll-batch <0...100>   use polling to wait for work (default: same as --poll
  -c,    --ctx-size N             size of the prompt context (default: 4096, 0 = loaded from model)
         --no-context-shift       disables context shift on infinite text generation (default: disabled)
  -n,    --predict N              number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
  -b,    --batch-size N           logical maximum batch size (default: 2048)
  -ub,   --ubatch-size N          physical maximum batch size (default: 512)
         --keep N                 number of tokens to keep from the initial prompt (default: 0, -1 = all)
  -fa,   --flash-attn             enable Flash Attention (default: disabled)
  -e,    --escape                 process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
         --no-escape              do not process escape sequences
         --samplers SAMPLERS      samplers that will be used for generation in the order, separated by ';' (default: dry;top_k;typ_p;top_p;min_p;xtc;temperature)
         --sampling-seq SEQUENCE  simplified sequence for samplers that will be used (default: dkypmxt)
         --penalize-nl            penalize newline tokens (default: false)
         --temp T                 temperature (default: 0.8)
         --top-k N                top-k sampling (default: 40, 0 = disabled)
         --top-p P                top-p sampling (default: 0.9, 1.0 = disabled)
         --min-p P                min-p sampling (default: 0.1, 0.0 = disabled)
         --xtc-probability N      xtc probability (default: 0.0, 0.0 = disabled)
         --xtc-threshold N        xtc threshold (default: 0.1, 1.0 = disabled)
         --typical P              locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
         --repeat-last-n N        last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
         --repeat-penalty N       penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
         --presence-penalty N     repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
         --frequency-penalty N    repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
         --dry-multiplier N       set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
         --dry-base N             set DRY sampling base value (default: 1.75)
         --dry--allowed-length N  set allowed length for DRY sampling (default: 2)
         --dry-penalty-last-n N   set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 = context size)
         --dry-sequence-breaker N 
                                  add sequence breaker for DRY sampling, clearing out default breakers (
                                  ;:;";*) in the process; use "none" to not use any sequence breakers
         --dynatemp-range N       dynamic temperature range (default: 0.0, 0.0 = disabled)
         --dynatemp-exp N         dynamic temperature exponent (default: 1.0)
         --mirostat N             use Mirostat sampling, Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
         --mirostat-lr N          Mirostat learning rate, parameter eta (default: 0.1)
         --mirostat-ent N         Mirostat target entropy, parameter tau (default: 5.0)
  -l     --logit-bias TOKEN_ID(+/-)BIAS
                                  modifies the likelihood of token appearing in the completion, i.e. "--logit-bias 15043+1" to increase likelihood of token ' Hello', or "--logit-bias 15043-1" to decrease likelihood of token ' Hello'
         --grammar GRAMMAR        BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '')
         --grammar-file FILE      file to read grammar from
  -j,    --json-schema SCHEMA     JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object. For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead
         --rope-scaling {none,linear,yarn}
                                  RoPE frequency scaling method, defaults to linear unless specified by the model
         --rope-scale N           RoPE context scaling factor, expands context by a factor of N
         --rope-freq-base N       RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
         --rope-freq-scale N      RoPE frequency scaling factor, expands context by a factor of 1/N
         --yarn-orig-ctx N        YaRN: original context size of model (default: 0 = model training context size)
         --yarn-ext-factor N      YaRN: extrapolation mix factor (default: -1.0, 0.0 = full interpolation)
         --yarn-attn-factor N     YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
         --yarn-beta-fast N       YaRN: low correction dim or beta (default: 32.0)
         --yarn-beta-slow N       YaRN: high correction dim or alpha (default: 1.0)
  -nkvo, --no-kv-offload          disable KV offload
         --cache-prompt           enable caching prompt (default: enabled)
         --cache-reuse N          min chunk size to attempt reusing from the cache via KV shifting, implicit --cache-prompt if value (default: 0)
  -ctk,  --cache-type-k TYPE      KV cache data type for K (default: f16)
  -ctv,  --cache-type-v TYPE      KV cache data type for V (default: f16)
  -dt,   --defrag-thold N         KV cache defragmentation threshold (default: 0.1, < 0 - disabled)
  -np,   --parallel N             number of parallel sequences to decode (default: 1)
  -cb,   --cont-batching          enable continuous batching (a.k.a dynamic batching) (default: enabled)
  -nocb, --no-cont-batching       disable continuous batching
         --mmproj FILE            path to a multimodal projector file for LLaVA
         --mlock                  force system to keep model in RAM rather than swapping or compressing
         --no-mmap                do not memory-map model (slower load but may reduce pageouts if not using mlock)
         --numa TYPE              attempt optimizations that help on some NUMA systems
                                    - distribute: spread execution evenly over all nodes
                                    - isolate: only spawn threads on CPUs on the node that execution started on
                                    - numactl: use the CPU map provided by numactl
                                  if run without this previously, it is recommended to drop the system page cache before using this, see https://github.com/ggerganov/llama.cpp/issues/1437
         --control-vector FILE    add a control vector
         --control-vector-scaled FILE SCALE
                                  add a control vector with user defined scaling SCALE
         --control-vector-layer-range START END
                                  layer range to apply the control vector(s) to, start and end inclusive
         --no-warmup              skip warming up the model with an empty run
         --spm-infill             use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this (default: disabled)
  -sp,   --special                special tokens output enabled (default: false)

server/completion/speculative:

         --draft-max, --draft, --draft-n N
                                  number of tokens to draft for speculative decoding (default: 16)
         --draft-min, --draft-n-min N
                                  minimum number of draft tokens to use for speculative decoding (default: 5)
         --draft-p-min P          minimum speculative decoding probability (greedy) (default: 0.9)
  -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
  -devd, --device-draft <dev1,dev2,...>
                                  comma-separated list of devices to use for offloading the draft model (none = don't offload)
                                  use --list-devices to see a list of available devices
  -ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                  number of layers to store in VRAM for the draft model
         --lookup-ngram-min N     minimum n-gram size for lookup cache (default: 0, 0 = disabled)
  -lcs,  --lookup-cache-static FILE
                                  path to static lookup cache to use for lookup decoding (not updated by generation)
  -lcd,  --lookup-cache-dynamic FILE
                                  path to dynamic lookup cache to use for lookup decoding (updated by generation)
         --pooling                pooling type for embeddings, use model default if unspecified

server/images:

         --image-max-batch N      maximum batch count (default: 4)
         --image-max-height N     image maximum height, in pixel space, must be larger than 256 (default: 1024)
         --image-max-width N      image maximum width, in pixel space, must be larger than 256 (default: 1024)
         --image-guidance N       the value of guidance during the computing phase (default: 3.500000)
         --image-strength N       strength for noising, range of [0.0, 1.0] (default: 0.750000)
         --image-sampler TYPE     sampler that will be used for generation, automatically retrieve the default value according to --model, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm
         --image-sample-steps N   number of sample steps, automatically retrieve the default value according to --model, and +10 when requesting high definition generation
         --image-cfg-scale N      for sampler, the scale of classifier-free guidance in the output phase, automatically retrieve the default value according to --model (1.0 = disabled)
         --image-schedule TYPE    denoiser sigma schedule, select from default;discrete;karras;exponential;ays;gits (default: default)
         --image-no-text-encoder-model-offload
                                  disable text-encoder(clip-l/clip-g/t5xxl) model offload
         --image-clip-l-model PATH
                                  path to the CLIP Large (clip-l) text encoder, or use --model included
         --image-clip-g-model PATH
                                  path to the CLIP Generic (clip-g) text encoder, or use --model included
         --image-t5xxl-model PATH 
                                  path to the Text-to-Text Transfer Transformer (t5xxl) text encoder, or use --model included
         --image-no-vae-model-offload
                                  disable vae(taesd) model offload
         --image-vae-model PATH   path to Variational AutoEncoder (vae), or use --model included
         --image-vae-tiling       indicate to process vae decoder in tiles to reduce memory usage (default: disabled)
         --image-taesd-model PATH 
                                  path to Tiny AutoEncoder For StableDiffusion (taesd), or use --model included
         --image-upscale-model PATH
                                  path to the upscale model, or use --model included
         --image-upscale-repeats N
                                  how many times to run upscaler (default: 1)
         --image-no-control-net-model-offload
                                  disable control-net model offload
         --image-control-net-model PATH
                                  path to the control net model, or use --model included
         --image-control-strength N
                                  how strength to apply the control net (default: 0.900000)
         --image-control-canny    indicate to apply canny preprocessor (default: disabled)

rpc-server:

         --rpc-server-host HOST   ip address to rpc server listen (default: 0.0.0.0)
         --rpc-server-port PORT   port to rpc server listen (default: 0, 0 = disabled)
         --rpc-server-main-gpu N  the GPU VRAM to use for the rpc server (default: 0, -1 = disabled, use RAM)
         --rpc-server-reserve-memory MEM
                                  reserve memory in MiB (default: 0)

Available environment variables (if the corresponding command-line option is not provided):

Server API

The available endpoints for the LLaMA Box server mode are:

Tools

It was so hard to find a Chat UI that was directly compatible with OpenAI, that mean, no installation required (I can live with docker run), no tokens (or optional), no Ollama required, just a simple RESTful API.

So we are inspired by the llama.cpp/chat.sh and adjust it to interact with LLaMA Box.

All you need is a Bash shell, curl and jq.

[!NOTE] Both completion.sh and chat.sh are used for talking with the LLaMA Box, but completion.sh embeds a fixed pattern to format the given prompt format, while chat.sh can leverage the chat template from the model's metadata or user defined.

$ # one-shot chat
$ MAX_TOKENS=4096 ./llama-box/tools/chat.sh "Tell me a joke"

$ # interactive chat
$ MAX_TOKENS=4096 ./llama-box/tools/chat.sh

$ # one-shot image generation
$ ./llama-box/tools/image_generate.sh "A lovely cat"

$ # interactive image generation
$ ./llama-box/tools/image_generate.sh

$ # one-shot image editing
$ IMAGE=/path/to/image.png ./llama-box/tools/image_edit.sh "A lovely cat"

$ # interactive image editing
$ IMAGE=/path/to/image.png ./llama-box/tools/image_generate.sh

$ # one-shot completion
$ N_PREDICT=4096 TOP_K=1 ./llama-box/tools/completion.sh "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include"

$ # interactive completion
$ N_PREDICT=4096 ./llama-box/tools/completion.sh

License

MIT