Goal

Jan should be able to seamlessly move from Nitro to cortex.cpp
What is the scope of change?
- Different inference extensions? (e.g. nitro-extension, and cortex-extension?)
- Data Structures (old legacy folders, vs. new?)
- Separation of concerns (e.g. Jan used to be in charge of model downloads, now calls cortex.cpp instead?)
What is our strategy?
- Parallel: support both legacy and new
- Migration: move from old Nitro to new cortex.cpp?

Tasklist

[ ] Clearly articulate the architectural change that needs to happen
[ ] Clearly articulate the scope of changes we need to account for
[ ] Figure out our migration strategy

Scope of changes

Nitro Inference Extension
Model Extension
Monitoring Extension
Nitro inference extension

Current implementation

Register Models (pre-populate model.json files) Any extensions register models on load will pre-populate model.json under /models/[model-id]/model.json

sequenceDiagram
participant ModelExtension
participant BaseExtension
participant FileSystem

ModelExtension->>BaseExtension: Register Models
BaseExtension->>BaseExtension: Pre-populate Data
BaseExtension->>FileSystem: Write to /models

Load Model:
- Set additional .dll/.so PATH (for engine loading)
- Hardware Information (to decide engine binary)
- Run nitro server
- Parse prompt template
- Load a GGUF model with its file path and model settings (passed from App)

sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: loadModel
    NitroInferenceExtension->>NitroInferenceExtension: killProcess
    NitroInferenceExtension->>NitroInferenceExtension: fetch hardware information
    NitroInferenceExtension->>child_process: spawn Nitro process
    NitroInferenceExtension->>NitroServer: wait for server healthy
    NitroInferenceExtension->>NitroInferenceExtension: parsePromptTemplate
    NitroInferenceExtension->>NitroServer: send loadModel request
    NitroInferenceExtension->>NitroServer: wait for model loaded

Inference (inheritance - OAIEngine.ts) Any extensions inheriting from the Base OAI Engine class will forward requests to their respective inference endpoints.

sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: inference
    NitroInferenceExtension->>NitroInferenceExtension: transform payload
    NitroInferenceExtension->>NitroServer: chat/completions

Possible Changes

Current ❌	Upcoming ✅
Run Nitro server on model load	Run cortex.cpp daemon service on start
Kill nitro process on pre-model-load and pre-app-exit	Keep cortex-cpp alive, daemon process, stop on exit
Heavy hardware detection & prompt processing	Just send a request
So many requests (check port, check health, model load status)	One request to do the whole thing
Mixing of model management and inference - Multiple responsibilities	Single responsibility

Model extension

Current implementation

Download Model (ModelFile as payload)
Delete Model (ModelFile as payload)
Get Models (Scan through models folder and return ModelFile[])
Import Model (Generate ModelFile and download)
Fetch HF Repo Data (for HF model import selection)

App retrieves pre-populated models:

sequenceDiagram

App ->> ModelExtension: get available models
ModelExtension ->> FS: read /models
FS --> ModelExtension : ModelFile

App downloads a model:


sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App imports a model


sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
ModelExtension ->> model.json :generate
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App deletes a model

graph LR

App --> |remove| Model_Extension
Model_Extension --> |FS unlink| /models/model/__files__

Possible Changes

Current ❌	Upcoming ✅
Implementation - Depends on FS	Abstraction - API Forwarding
List Available Models: Scan through Model Folder	GET /models
Delete: Unlink FS	DELETE /models
Download: Download	POST & Progress /models/pulls
Broken Model Import - Using default model.json	cortex.cpp handles the model metadata
Model prediction depends on model size & available RAM/VRAM only	cortex.cpp predicts base on hardware and model.yaml

System Monitoring extension

Current implementation

Get GPU Settings
Get System Information

App get resources information

graph LR

App --> |getResourcesInfo| Model_Extension
Model_Extension --> |fetch| node-os-utils
Model_Extension --> |getCurrentLoad| nvidia-smi

Possible Changes

Current ❌	Upcoming ✅
Implementation - Depends on FS & CMD	Abstraction - API Forwarding
Execute CMD	GET - Hardware Information Endpoint

Overview

Current ❌

Upcoming ✅

Assumption

cortex.cpp bundles multiple engines (different CPU instructions and CUDA versions)
cortex.cpp support /models APIs
- GET: /models (available, active status, compatibility prediction)
- POST: /models/pull (& progress?)
- DELETE: /models
cortex.cpp support /hardware-information API

Challenges of moving Nitro to cortex.cpp

Different Data (Folder & File) structures
Backward / Forward compatibility

The migration

How to seamlessly move from Nitro to cortex.cpp, where:
- cortex.cpp works with new Data Folder structure
- cortex.cpp works with model.yaml
- cortex.cpp works with models.list
How to maintain the data folder when users switch back to older versions?
- Older versions rely on model-extension, which searches for a model.json file within the Data Folder.
- Newer versions rely on cortex-extension, which searches for a model.yaml file within the Data Folder.

Let's think about a couple of our principles.

We don't remove or manipulate user data.
Rollback should always work.
Minimal migration

What are some of the main concerns here?
Can we use model.json and model.yaml side by side?
1. We should. Since the model folder can contain anything, from README.md, .gitignore, GGUF, model.yaml to model.json.
2. Older versions will still function with legacy model.json files.
3. Newer versions will work with the latest model.yaml files.
How to sync between those two?
1. It's hard to sync between those two, since different structures could break the app.
2. We just try to migrate once when there is no models.list available. This is a good flag for migration triggering.
3. After migrating, each app version works independently with its own model file format.
How about model pre-population? In other words, Model Hub.
1. Model pre-population is an anti-pattern. Pre-populated models do not work with versioning or create unwanted data that confuse users. How about our Model Hub list thousands of models?
2. We implemented model import, which replaces the need for a model file. Users can just import with the HF repo ID. Users do not have any reason to duplicate or edit a pre-populated model.json.
3. Model listing can be done from the extension.
4. In short, in the next version, we don't pre-populate unwanted files to the Data Folder. Only when users decide to download.
5. Users deleting a model means deleting the persisted model.yaml & model files.
How do other extensions work with their models? E.g., OpenAI
1. Remote models can be populated during the build, not persisted. registerModels now persists in-memory model DTO.
2. We don't pre-populate remote models, which is not necessary. Users are better setting them from Extension Settings. It's more or less an Extension configuration, not Model population.
Migration complexity and UX
1. We don't convert model.json to model.yaml. Instead, import with symlink. It could be faster and avoid new logic added from Jan, which is redundant. Lightweight migration with less risk. Maintain the Model ID is key; otherwise, all threads break.
2. We don't move any files, which could drag the migration process long. E.g., GGUF
3. How about new/manual adding GGUFs? The model symlink feature is always there for that.
4. There are bad migration experiences in the past that we can avoid such as:
  1. Migrate all pre-populated models
  2. Heavy file movement drags the duration long
  3. Migrate everything at once
5. Now we just migrate downloaded models:
  1. Import downloaded models only as symlinks (no file movement)
  2. Don't update the ID, which will kill us on data inconsistency
  3. Another thought: Do we really need to wait for model.yaml creation during migration?
    1. cortex.cpp can work with the models.list to provide available models?
    2. model.yaml generation is an asynchronous operation so:
      1. It generates model.yaml as soon as user try to get or load.
      2. It generates model.yaml as soon as user try to import.
      3. Don't block the client GUI; model list can be done with just the models.list contents. Any further operations with a certain model can generate a model.yaml later.
      4. Client will prioritize the active thread's model then others to not blocking users working threads.
      5. If something goes wrong, the GGUF file will still be there and can be generated later on other operations. The model.yaml file is not strictly required to be available, but just the cache of model file metadata?
Better cache mechanism
1. Model list and detail have worked with the File System before, and now they're sending an API request to cortex.cpp.
2. To prevent slow loading, the client should cache accordingly on the frontend.
  Summary
  
  In short, the entire migration process is just to create symlinks to downloaded models from models.list. No model.yaml or folder manipulation involved. It should be done instantly?

Migrate indicator: models.list exist.

Don't pre-populate models. Remote Extensions work with their own settings instead of pre-populated models. Cortex Extension registers in-memory available to pull models (templates).

cortex.cpp is a daemon process that should be started alongside the app.

Jan migration from 0.5.x to 0.6.0

Bundled Engines

Is it possible that, cortex-cpp bundles multiple engines, but expose only 1 gateway?

E.g. The client requests to load a llama.cpp model, but cortex.cpp can predict the hardware compatible and run an efficient binary.

So:

Clients do not need to send any extra engine parameters or minimal (type).
Clients don't need to parse prompt templates, that's something the model should handle.
cortex.cpp owns the model metadata, allowing it to operate independently.
cortex.cpp masks up the complex binary distribution, exposing a simple interface.
GPU ON/OFF - GPU Selections can be done via engine /settings?

Eventually, that's all it needs to work with – the Model ID (aka model name).

Simplify `model load / chat completions` request

Incremental Path

We do what's not related to cortex.cpp first - Remote Extensions & Pre-populated Models
1. Rather than pre-populate, enhance the model configurations.
2. registerModels now lists available models for download, don't persist model.json.
Better data caching
1. Data retrieved from extensions should be cached on the frontend for subsequent loads.
2. Reduce direct API requests and perform more data synchronization operations.
3. Implementing a good cache layer would save a bad user experience during migration later, where the app doesn't need to scan through the models list, but can just dump cached data and imports right away. It won't interrupt users' working threads since asynchronous operations take care of data persistence (model.yaml), and model load requests are typically long-delayed responses.
Minimal Migration Steps (cortex-cpp ready)
1. Generate models.list based on cached data, do not need to scan the Model Folder, which can be costly.
2. Send model import or symlink requests to generate models.list. It would be great if cortex.cpp could support batch symlinks (import), as that would only require creating a models.list file. The model.yaml files can be generated asynchronously. (This would cover the case user edits the models.list manually)
3. Update extensions to redirect requests.
4. The worst-case scenario is when users update from significantly older versions that lack cache improvements. Go through model folders and send import requests. During app update.

sequenceDiagram
    participant App as "App"
    participant Models as "Models"
    participant ModelList as "Model List"
    participant ModelYaml as "Model YAML"

    App->>Models: import
    activate Models
    Models->>ModelList: update models.list
    activate ModelList
    ModelList->>Models: return data
    deactivate ModelList
    Models->>ModelYaml: generate (async)
    activate ModelYaml
    Note right of ModelYaml: generate model.yaml asynchronously
    ModelYaml->>Models: (async) generated
    deactivate ModelYaml
    deactivate Models

This is really well thought through.

Questions @louis-jan :

What are the specific attributes Jan needs from the get hardware info endpoint?
- OS info
- CPU info
- RAM size (total, utilized)
- GPU SKU
- VRAM size (total, utilized)
- What else? (do you need storage info, or additional unforeseen stats?)
Do you need hardware configuration endpoints? i.e. does Jan ever need to let users change some hardware level configuration.
Dumb question, but do you need engine status endpoints? Or is the level of abstraction good enough at the model level
What are the Cortex sub-process endpoints needed? /keepalive /healthcheck?

@0xSage All great questions, related issue that we discussed Hardware Info endpoint. https://github.com/janhq/cortex.cpp/issues/1165#issuecomment-2355343899

Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...)
Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.
/healthcheck is needed, and implemented from cortex.cpp

The one blocking this is download progress sync, which we aligned with a socket approach.

Nice, always great when the best answer is the simplest!! 🙏

On Mon, Sep 30, 2024 at 10:35 AM Louis @.***> wrote:

@0xSage https://github.com/0xSage All great questions, related issue that we discussed Hardware Info endpoint. janhq/cortex.cpp#1165 (comment) https://github.com/janhq/cortex.cpp/issues/1165#issuecomment-2355343899

Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...)

Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.

/healthcheck is needed, and implemented from cortex.cpp

The one blocking this is download progress sync, which we aligned with a socket approach.

— Reply to this email directly, view it on GitHub https://github.com/janhq/jan/issues/3690#issuecomment-2381888887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQVWFCD6FUKNF7B3EWM42Q3ZZC2IBAVCNFSM6AAAAABOLJYFMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRHA4DQOBYG4 . You are receiving this because you were mentioned.Message ID: @.***>

Package cortex.cpp into Jan app:

The app will bundle all available cortex.cpp binaries, just like the cortex.cpp installer.
During update, it executes cortex engines install --sources [path_to_binaries] or via API, cortex will detect hardware and install accordingly (from sources).
App Settings
- Use cortex engines get to see the installed variant -> Update UI accordingly, e.g. GPU On/Off.
- cortex.cpp will include a flag to select the variant, letting users choose their GPU from the app settings. Then, it will install the appropriate cortex engine version accordingly. Since all binaries are included, it's simply about switching between variants. (https://github.com/janhq/cortex.cpp/issues/1390)
- Investigating on Multiple GPUs support... TBD

Pros

No internet connection is required.
The same user experience.

Cons

The package has become a bit larger because it now includes the Cuda DLLs. However, no additional downloads are needed.

cortex.cpp configurations on spawn

Config path
Host & Port
Log paths
Data folder path
HF Token (via Env?)
Proxy Configs (via Env?) - KIV for now

Seamless Models Imports & Run

Since we no longer use models.list, it's now simply based on the engine name for imports.

App 0.6.x opens (given that users have updated to 0.5.5)
Trigger get models on load as usual (extension - asynchronously - background).
Go through the downloaded models (from the cache) — very fast since only previously downloaded models are involved.
Send model import requests for nitro models.
Persist the cache.

This operation runs asynchronously, won't affect the UX, since it works with cached data. Even the model is not imported yet, it should still function normally (with stateless model load endpoint). There will be a door for broken requests or attempts.

In case users are updating from a very old version, we run a scan on model.json and persist the cache (as the current - legacy logics) -> Continue with (1)

Current issues being faced:

Jan/Cortex Data Folder issues (backward compatible)

@dan-homebrew: will create an Implementation Issue: @louis-jan Can you link the implementation-related issues here:

janhq / jan

planning: Jan's path to cortex.cpp? #3690

Goal

Tasklist

Scope of changes

Nitro inference extension

Model extension

System Monitoring extension

Overview

Assumption

Challenges of moving Nitro to cortex.cpp

The migration

Summary

Bundled Engines

Eventually, that's all it needs to work with – the Model ID (aka model name).

Incremental Path

Package cortex.cpp into Jan app:

cortex.cpp configurations on spawn

Seamless Models Imports & Run