epic: Cortex Hardware API

dan-homebrew commented 1 week ago

Goal

Cortex has a clear CLI and API to select active hardware
Cortex can list all available hardware
Cortex can activate specific hardware (e.g. CPU-only, or specific GPU)
List of active hardware is passed down to the engine (e.g. llama.cpp or TensorRT-LLM)
Fallbacks (e.g. if GPU inference fails due to drivers not installed, run inference on CPU?)
How does this interact with ngl settings?

Tasklist

[ ] Design CLI (e.g. cortex hardware list?)
[ ] Design API (e.g. GET /hardware)
[ ] Is it possible for us to have a consistent hardware list across engines?
[ ] Update Spec with architecture https://github.com/janhq/cortex.so/issues/193
[ ] https://github.com/janhq/cortex.cpp/issues/1142
[ ] #1141

Context

Cortex.cpp's Hardware API should enable us to do this in Jan

dan-homebrew commented 1 week ago

@louis-jan I'm assigning this to you in Sprint 20, as this has a significant CLI and API design component.

Can discuss later in the week if you want/can take on the implementation of it
May be a good exercise for you to gain deep understanding of our C++ codebase

EDIT: adding @nguyenhoangthuan99 for implementation

dan-homebrew commented 3 days ago

@louis-jan @nguyenhoangthuan99 I am going to move this to Sprint 21, as I think you guys should land the Model Folder and model.yaml first.

nguyenhoangthuan99 commented 1 day ago

The hardware detection serves two main purposes:

Installing the correct version of the engine.
Running a model that fits the available resources.

To achieve these goals and to make debugging easier, as well as to help users choose the appropriate model, the hardware API/CLI should provide the following information:

Operating System (OS)
Number of CPU threads
Amount of free RAM
Presence of AVX instructions (a set of CPU instructions that can accelerate certain computations)
GPU information, including:
- GPU ID
- GPU name
- GPU architecture
- GPU driver version
- CUDA driver version (for NVIDIA GPUs)
- Compute capability (for NVIDIA GPUs)
- Free VRAM (video memory)

example return body:

{
  "os": "windows",
  "arch": "amd64",
  "suitable_avx": "avx2",
  "free_memory": 8192,
  "gpu_info": [
    {
      "id": "0",
      "name": "NVIDIA GeForce RTX 3090",
      "arch": "ampere",
      "driver_version": "552.12",
      "cuda_driver_version": "12.4",
      "compute_cap": "8.6",
      "free_vram": 8192
    }
  ]
}

Note: The commenter mentions that getting the free VRAM information from C++ is challenging and requires further investigation (current approach is parse from output of nvidia-smi command). This information would allow the system to make informed decisions about which engine version to install and which models can run efficiently on the user's hardware. It also provides valuable data for debugging purposes. cc @louis-jan for recommendation from Jan app for easier integration

louis-jan commented 1 day ago

From Jan, we expect to just have some sort of information to select a corresponding engine versions / Settings, such as CPU Instructions / GPUs

But we need to gather comprehensive hardware information for debugging, including CPU, GPU, RAM, OS, and connected monitors (as issues like projector connections have been known to impact performance).

Structure

To make it easier for user support, the hardware information should be grouped for quick lookup (users support), a mix of flattened and grouped structures can be visually overwhelming.

E.g. The supporter have to scroll to the bottom of the file to see os

❌	✅	✅✅
```json { "arch": "", "free_memory": "", "gpus": [ {}, {}, {} ], "os":"" } ```	```json { "device": { "arch": "", "free_memory": "", "os":"" } "gpus": [ {}, {}, {} ], } ```	```json { "cpu": { "arch": "x64", "cores": "4", "model": "Intel Core i9 12900K", "instructions": [ "AVX512", "FMA", "SSE" ] }, "os": { "version": "10.2", "name": "Windows 10 Pro" }, "power": { "battery_life": 80, "charging_status": "charged", "is_power_saving": false }, "ram": { "total": "16", "available": "12", "type": "DDR4" // better model name? }, "storage": { "total": 512, "available": 256, "type": "SSD" // better model name? }, "gpus": [ {}, {}, {} ], "monitors": [ ] } ```

Consistent from system to system

Different devices but the same output format, such as GPU Driver. Should not have different response body structure per GPU family.

E.g.

❌	✅
```json "graphics": [ { "id": "0", "name": "NVIDIA GeForce RTX 3090", "driver_version": "552.12", "cuda_driver_version": "12.4", "compute_cap": "8.6", "free_vram": 8192 }, { "id": "1", "name": "AMD Radeon RX 6800 XT", "driver_version": "5.0.2?", "cuda_driver_version": "?", "compute_cap": "?", "free_vram": 8192 }, ] ```	```json "graphics": [ { "id": "0", "name": "NVIDIA GeForce RTX 3090", "version": "12.4", "additional_information": { "driver_version": "552.12", "compute_cap": "8.6" }, "free_vram": 8192, "total_vram": 8192 }, { "id": "1", "name": "AMD Radeon RX 6800 XT", "version": "6.1", "free_vram": 8192, "total_vram": 8192 "additional_information": { "rocm_git_revision": "0d0a7a10c1a3" }, }, ] ```

Try to gather anything could affect the performance?

Connected monitors?
RAM model / bus?
Power Saving mode?

Request

It would be beneficial to have filter query support, allowing clients to only poll for the data they need. E.g. ?filters=gpu,cpu

@nguyenhoangthuan99 @dan-homebrew

janhq / cortex.cpp