Open imtuyethan opened 2 months ago
Note: This should be driven by Cortex team, with Jan UI as one of the task items.
I think this is part of a larger "Hardware Detection, Config and Recommendations"
This is also being discussed in janhq/jan#1089 - let's link both issues. We will need to scope this to something less ambigious
GET /models
, based on activated/detected hardwareShifting to Sprint 21 to allow team to focus on Model Folder execution in Sprint 20
To calculate the total number of memory buffer require for a model, firstly let break it into many parts:
ngl
setting)ctx_len
processing)Model weight A model weight has 3 part:
Token embeddings: shape (n_vocab, embedding length): This part is always allocated in CPU RAM and will be calculated by n_vocab * embedding_length * 2 * quant_bit/16
bytes, The quant_bit related to quantization level of the models (e.g Q4_K_M -> quant_bit for token embedding will be Q4_K = 4.5 bit).
repeated transformer layers: This part can be used to update the ngl
setting and make the model fit with GPU Vram
Output layer: This part is treated as 1 layer in ngl
settings, for example, if total ngl
of model is 33 and we set ngl=32
-> the output layer will be loaded to CPU RAM and the rest repeated transformer layers will be loaded to GPU. The Output Layer will be calculated by n_vocab * embedding length * 2 * quant_bit/16
. The quant_bit for output layer usually greater than the quantization level of model and we can't estimate exact quantization level for every model for this (e.g model quantization Q4_KM -> output layer Q6_K)
Summary, the equation for Model weight:
RAM = token_embeddings_size + ((total_ngl-ngl) >=1 ? Output_layer_size + (total_ngl - ngl - 1 ) / (total_ngl-1) * (total_file_size - token_embeddings_size - Output_layer_size) : 0 ) (bytes)
VRAM = total_file_size - RAM (bytes)
**KV cache**
The kv cache is calculated by follow:
kv_cache_size = (ngl-1)/33 ctx_len/8192 hidden_dim/4096 quant bit/16 1 (GB)
quant_bit for kv_cache has 3 mode (f16 = 16bits, q8_0 = 8 bits, q4_0 = 4.5 bits)
**Buffer for preprocessing prompt**
The buffer for preprocess prompts related to `n_batch` and `n_ubatch`:
VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB)
When we are not load all `ngl` to GPU, the buffer need to reserve an extra memory buffer for output layer, in this case
VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB) + Output_layer_size
the default `n_batch` and `n_ubatch` for cortex.llamacpp is 2048.
We also need to reserve extra 100 MiB -200 MiB of Ram for some small buffers during processing.
GET /v1/models
Response
{
"data" : [
{
"model": "model_1",
...
"recommendation": {
"cpu_mode": {
"ram": number
},
"gpu_mode": [{
"ram": number,
"vram": number,
"ngl": number,
"context_length": number,
"recommend_ngl": number
}]
}
}
]
}
Get model list information
cortex model list --cpu_mode --gpu_mode
If no flag is specified, display only model id
Goal
model.yaml
GET /models
andGET /model/<model_id>
)Related Issues
Original Post
Specs
https://www.notion.so/jan-ai/Hardware-Detection-and-Recommendations-b04bc3109c2846d58572415125e0a9a5?pvs=4
Key user stories
Design
https://www.figma.com/design/DYfpMhf8qiSReKvYooBgDV/Jan-App-(3rd-version)?node-id=5115-60038&t=OgzCw09qXKxZj3DC-4