Open dan-homebrew opened 2 months ago
Current implementation
Register Models (pre-populate model.json files)
Any extensions register models on load will pre-populate model.json
under /models/[model-id]/model.json
sequenceDiagram
participant ModelExtension
participant BaseExtension
participant FileSystem
ModelExtension->>BaseExtension: Register Models
BaseExtension->>BaseExtension: Pre-populate Data
BaseExtension->>FileSystem: Write to /models
sequenceDiagram
participant App
participant NitroInferenceExtension
participant NitroServer
App->>NitroInferenceExtension: loadModel
NitroInferenceExtension->>NitroInferenceExtension: killProcess
NitroInferenceExtension->>NitroInferenceExtension: fetch hardware information
NitroInferenceExtension->>child_process: spawn Nitro process
NitroInferenceExtension->>NitroServer: wait for server healthy
NitroInferenceExtension->>NitroInferenceExtension: parsePromptTemplate
NitroInferenceExtension->>NitroServer: send loadModel request
NitroInferenceExtension->>NitroServer: wait for model loaded
sequenceDiagram
participant App
participant NitroInferenceExtension
participant NitroServer
App->>NitroInferenceExtension: inference
NitroInferenceExtension->>NitroInferenceExtension: transform payload
NitroInferenceExtension->>NitroServer: chat/completions
Possible Changes
Current ❌ | Upcoming ✅ |
---|---|
Run Nitro server on model load | Run cortex.cpp daemon service on start |
Kill nitro process on pre-model-load and pre-app-exit | Keep cortex-cpp alive, daemon process, stop on exit |
Heavy hardware detection & prompt processing | Just send a request |
So many requests (check port, check health, model load status) | One request to do the whole thing |
Mixing of model management and inference - Multiple responsibilities | Single responsibility |
Current implementation
App retrieves pre-populated models
:
sequenceDiagram
App ->> ModelExtension: get available models
ModelExtension ->> FS: read /models
FS --> ModelExtension : ModelFile
App downloads a model
:
sequenceDiagram
App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress
App imports a model
sequenceDiagram
App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
ModelExtension ->> model.json :generate
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress
App deletes a model
graph LR
App --> |remove| Model_Extension
Model_Extension --> |FS unlink| /models/model/__files__
Possible Changes
Current ❌ | Upcoming ✅ |
---|---|
Implementation - Depends on FS | Abstraction - API Forwarding |
List Available Models: Scan through Model Folder | GET /models |
Delete: Unlink FS | DELETE /models |
Download: Download | POST & Progress /models/pulls |
Broken Model Import - Using default model.json | cortex.cpp handles the model metadata |
Model prediction depends on model size & available RAM/VRAM only | cortex.cpp predicts base on hardware and model.yaml |
Current implementation
App get resources information
graph LR
App --> |getResourcesInfo| Model_Extension
Model_Extension --> |fetch| node-os-utils
Model_Extension --> |getCurrentLoad| nvidia-smi
Possible Changes
Current ❌ | Upcoming ✅ |
---|---|
Implementation - Depends on FS & CMD | Abstraction - API Forwarding |
Execute CMD | GET - Hardware Information Endpoint |
Current ❌ |
---|
Upcoming ✅ |
engines
(different CPU instructions and CUDA versions)/models
APIs
/hardware-information
APImodels.list
model-extension
, which searches for a model.json
file within the Data Folder.cortex-extension
, which searches for a model.yaml
file within the Data Folder.Let's think about a couple of our principles.
Minimal migration
What are some of the main concerns here?
model.json
and model.yaml
side by side?
model.json
files.models
? E.g., OpenAI
registerModels
now persists in-memory model DTO.models.list
to provide available models?model.yaml
as soon as user try to get or load.model.yaml
as soon as user try to import.model list
can be done with just the models.list
contents. Any further operations with a certain model can generate a model.yaml
later.In short, the entire migration process is just to create symlinks to downloaded models from models.list
. No model.yaml
or folder manipulation involved. It should be done instantly?
Migrate indicator: models.list
exist.
Don't pre-populate models. Remote Extensions work with their own settings instead of pre-populated models. Cortex Extension registers in-memory available to pull models (templates).
cortex.cpp
is a daemon process that should be started alongside the app.
Jan migration from 0.5.x to 0.6.0 |
---|
Is it possible that, cortex-cpp
bundles multiple engines, but expose only 1 gateway?
E.g. The client requests to load a llama.cpp model, but cortex.cpp can predict the hardware compatible and run an efficient binary.
So:
engine /settings
? Simplify model load / chat completions request |
---|
registerModels
now lists available models for download, don't persist model.json.models.list
based on cached data, do not need to scan the Model Folder, which can be costly.models.list
. It would be great if cortex.cpp could support batch symlinks (import), as that would only require creating a models.list
file. The model.yaml files can be generated asynchronously. (This would cover the case user edits the models.list
manually)sequenceDiagram
participant App as "App"
participant Models as "Models"
participant ModelList as "Model List"
participant ModelYaml as "Model YAML"
App->>Models: import
activate Models
Models->>ModelList: update models.list
activate ModelList
ModelList->>Models: return data
deactivate ModelList
Models->>ModelYaml: generate (async)
activate ModelYaml
Note right of ModelYaml: generate model.yaml asynchronously
ModelYaml->>Models: (async) generated
deactivate ModelYaml
deactivate Models
This is really well thought through.
Questions @louis-jan :
get hardware info
endpoint?
status
endpoints? Or is the level of abstraction good enough at the model level@0xSage All great questions, related issue that we discussed Hardware Info endpoint. https://github.com/janhq/cortex.cpp/issues/1165#issuecomment-2355343899
status
and model status
were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.The one blocking this is download progress sync, which we aligned with a socket approach.
Nice, always great when the best answer is the simplest!! 🙏
On Mon, Sep 30, 2024 at 10:35 AM Louis @.***> wrote:
@0xSage https://github.com/0xSage All great questions, related issue that we discussed Hardware Info endpoint. janhq/cortex.cpp#1165 (comment) https://github.com/janhq/cortex.cpp/issues/1165#issuecomment-2355343899
- Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...)
- Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.
- /healthcheck is needed, and implemented from cortex.cpp
The one blocking this is download progress sync, which we aligned with a socket approach.
— Reply to this email directly, view it on GitHub https://github.com/janhq/jan/issues/3690#issuecomment-2381888887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQVWFCD6FUKNF7B3EWM42Q3ZZC2IBAVCNFSM6AAAAABOLJYFMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRHA4DQOBYG4 . You are receiving this because you were mentioned.Message ID: @.***>
cortex engines install --sources [path_to_binaries]
or via API, cortex will detect hardware and install accordingly (from sources).cortex engines get
to see the installed variant -> Update UI accordingly, e.g. GPU On/Off.Pros
Cons
Since we no longer use models.list
, it's now simply based on the engine name
for imports.
nitro
models.This operation runs asynchronously, won't affect the UX, since it works with cached data. Even the model is not imported yet, it should still function normally (with stateless model load endpoint). There will be a door for broken requests or attempts.
In case users are updating from a very old version, we run a scan on model.json and persist the cache (as the current - legacy logics) -> Continue with (1)
Current issues being faced:
@dan-homebrew: will create an Implementation Issue: @louis-jan Can you link the implementation-related issues here:
Goal
nitro-extension
, andcortex-extension
?)Tasklist