janhq / cortex.cpp

Local AI API Platform
https://cortex.so
Apache License 2.0
2.14k stars 127 forks source link

planning: Cortex Model Compatibility API #1108

Open imtuyethan opened 2 months ago

imtuyethan commented 2 months ago

Goal

Related Issues

Original Post

Specs

https://www.notion.so/jan-ai/Hardware-Detection-and-Recommendations-b04bc3109c2846d58572415125e0a9a5?pvs=4

Key user stories

dan-homebrew commented 2 months ago

Note: This should be driven by Cortex team, with Jan UI as one of the task items.

I think this is part of a larger "Hardware Detection, Config and Recommendations"

dan-homebrew commented 2 months ago

This is also being discussed in janhq/jan#1089 - let's link both issues. We will need to scope this to something less ambigious

dan-homebrew commented 2 months ago

Shifting to Sprint 21 to allow team to focus on Model Folder execution in Sprint 20

nguyenhoangthuan99 commented 2 weeks ago

To calculate the total number of memory buffer require for a model, firstly let break it into many parts:

Model weight A model weight has 3 part:

VRAM = total_file_size - RAM (bytes)


**KV cache**

The kv cache is calculated by follow:

kv_cache_size = (ngl-1)/33 ctx_len/8192 hidden_dim/4096 quant bit/16 1 (GB)

quant_bit for kv_cache has 3 mode (f16 = 16bits, q8_0 = 8 bits, q4_0 = 4.5 bits)

**Buffer for preprocessing prompt**

The buffer for preprocess prompts related to `n_batch` and `n_ubatch`:

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB)


When we are not load all `ngl` to GPU, the buffer need to reserve an extra memory buffer for output layer, in this case

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB) + Output_layer_size



the default `n_batch` and `n_ubatch` for cortex.llamacpp is 2048.

We also need to reserve extra 100 MiB -200 MiB of Ram for some small buffers during processing.
vansangpfiev commented 4 days ago

API documentation

GET /v1/models

Response

{
"data" : [
  {
    "model": "model_1",
    ...
    "recommendation": {
      "cpu_mode": {
        "ram": number
      },
      "gpu_mode": [{
        "ram": number,
        "vram": number,
        "ngl": number,
        "context_length": number,
        "recommend_ngl": number
      }]
    }
  }
]
}
vansangpfiev commented 2 days ago

CLI Documentation:

Get model list information

cortex model list --cpu_mode --gpu_mode

If no flag is specified, display only model id