imtuyethan commented 2 months ago

Goal

Cortex can generate a model compatibility prediction, based on user's hardware and model.yaml
This should be an API that Jan can call (potentially as part of GET /models and GET /model/<model_id>)
Likely linked to #1165
Model compatibility should "compute" based on Active Hardware

Related Issues

[ ] Update spec with architecture: https://github.com/janhq/cortex.cpp/discussions/1130
[x] https://github.com/janhq/cortex.cpp/issues/1143
[x] https://github.com/janhq/cortex.cpp/issues/1142
[ ] https://github.com/janhq/cortex.cpp/issues/1140
janhq/jan#1089

Original Post

Specs

https://www.notion.so/jan-ai/Hardware-Detection-and-Recommendations-b04bc3109c2846d58572415125e0a9a5?pvs=4

Key user stories

Migrate Hardware Settings from Advanced Settings to its own settings page.
Missing Hardware Dependencies → Ask users to download dependencies to turn on GPU acceleration.
...
Design

https://www.figma.com/design/DYfpMhf8qiSReKvYooBgDV/Jan-App-(3rd-version)?node-id=5115-60038&t=OgzCw09qXKxZj3DC-4

dan-homebrew commented 2 months ago

Note: This should be driven by Cortex team, with Jan UI as one of the task items.

I think this is part of a larger "Hardware Detection, Config and Recommendations"

Cortex can detect what hardware the user has: CPU, GPU(s)
Cortex can be instructed to use certain hardware (e.g. CPU-only, GPU, and which GPU)
Cortex can run models with specific hardware (e.g. GPU 1)
Cortex can assess hardware compatibility for a given model

dan-homebrew commented 2 months ago

This is also being discussed in janhq/jan#1089 - let's link both issues. We will need to scope this to something less ambigious

e.g. Cortex can provide "recommendations" in GET /models, based on activated/detected hardware

dan-homebrew commented 2 months ago

Shifting to Sprint 21 to allow team to focus on Model Folder execution in Sprint 20

nguyenhoangthuan99 commented 2 weeks ago

To calculate the total number of memory buffer require for a model, firstly let break it into many parts:

Model weight (related to ngl setting)
KV cache reservation (related to ctx_len processing)
Buffer for preprocessing prompt (require extra GPU Vram when GPU mode is enable)

Model weight A model weight has 3 part:

Token embeddings: shape (n_vocab, embedding length): This part is always allocated in CPU RAM and will be calculated by n_vocab * embedding_length * 2 * quant_bit/16 bytes, The quant_bit related to quantization level of the models (e.g Q4_K_M -> quant_bit for token embedding will be Q4_K = 4.5 bit).
repeated transformer layers: This part can be used to update the ngl setting and make the model fit with GPU Vram
Output layer: This part is treated as 1 layer in ngl settings, for example, if total ngl of model is 33 and we set ngl=32 -> the output layer will be loaded to CPU RAM and the rest repeated transformer layers will be loaded to GPU. The Output Layer will be calculated by n_vocab * embedding length * 2 * quant_bit/16. The quant_bit for output layer usually greater than the quantization level of model and we can't estimate exact quantization level for every model for this (e.g model quantization Q4_KM -> output layer Q6_K)

Summary, the equation for Model weight:


RAM = token_embeddings_size + ((total_ngl-ngl) >=1 ? Output_layer_size +  (total_ngl - ngl - 1 ) / (total_ngl-1) * (total_file_size - token_embeddings_size - Output_layer_size) : 0  )  (bytes)

VRAM = total_file_size - RAM (bytes)


**KV cache**

The kv cache is calculated by follow:

kv_cache_size = (ngl-1)/33 ctx_len/8192 hidden_dim/4096 quant bit/16 1 (GB)

quant_bit for kv_cache has 3 mode (f16 = 16bits, q8_0 = 8 bits, q4_0 = 4.5 bits)

**Buffer for preprocessing prompt**

The buffer for preprocess prompts related to `n_batch` and `n_ubatch`:

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB)


When we are not load all `ngl` to GPU, the buffer need to reserve an extra memory buffer for output layer, in this case

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB) + Output_layer_size



the default `n_batch` and `n_ubatch` for cortex.llamacpp is 2048.

We also need to reserve extra 100 MiB -200 MiB of Ram for some small buffers during processing.

vansangpfiev commented 4 days ago

API documentation

GET /v1/models

Response

{
"data" : [
  {
    "model": "model_1",
    ...
    "recommendation": {
      "cpu_mode": {
        "ram": number
      },
      "gpu_mode": [{
        "ram": number,
        "vram": number,
        "ngl": number,
        "context_length": number,
        "recommend_ngl": number
      }]
    }
  }
]
}

vansangpfiev commented 2 days ago

CLI Documentation:

Get model list information

cortex model list --cpu_mode --gpu_mode

If no flag is specified, display only model id

janhq / cortex.cpp

planning: Cortex Model Compatibility API #1108

Goal

Related Issues

Original Post

Specs

Key user stories

Design

API documentation

CLI Documentation: