janhq / jan

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer. Multiple engine support (llama.cpp, TensorRT-LLM)
https://jan.ai/
GNU Affero General Public License v3.0
23.72k stars 1.38k forks source link

planning: Jan's path to cortex.cpp? #3690

Open dan-homebrew opened 2 months ago

dan-homebrew commented 2 months ago

Goal

Tasklist

louis-jan commented 2 months ago

Scope of changes

sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: loadModel
    NitroInferenceExtension->>NitroInferenceExtension: killProcess
    NitroInferenceExtension->>NitroInferenceExtension: fetch hardware information
    NitroInferenceExtension->>child_process: spawn Nitro process
    NitroInferenceExtension->>NitroServer: wait for server healthy
    NitroInferenceExtension->>NitroInferenceExtension: parsePromptTemplate
    NitroInferenceExtension->>NitroServer: send loadModel request
    NitroInferenceExtension->>NitroServer: wait for model loaded
sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: inference
    NitroInferenceExtension->>NitroInferenceExtension: transform payload
    NitroInferenceExtension->>NitroServer: chat/completions

Possible Changes

Current ❌ Upcoming ✅
Run Nitro server on model load Run cortex.cpp daemon service on start
Kill nitro process on pre-model-load and pre-app-exit Keep cortex-cpp alive, daemon process, stop on exit
Heavy hardware detection & prompt processing Just send a request
So many requests (check port, check health, model load status) One request to do the whole thing
Mixing of model management and inference - Multiple responsibilities Single responsibility

Model extension

Current implementation

App retrieves pre-populated models:

sequenceDiagram

App ->> ModelExtension: get available models
ModelExtension ->> FS: read /models
FS --> ModelExtension : ModelFile

App downloads a model:


sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App imports a model


sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
ModelExtension ->> model.json :generate
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

App deletes a model

graph LR

App --> |remove| Model_Extension
Model_Extension --> |FS unlink| /models/model/__files__

Possible Changes

Current ❌ Upcoming ✅
Implementation - Depends on FS Abstraction - API Forwarding
List Available Models: Scan through Model Folder GET /models
Delete: Unlink FS DELETE /models
Download: Download POST & Progress /models/pulls
Broken Model Import - Using default model.json cortex.cpp handles the model metadata
Model prediction depends on model size & available RAM/VRAM only cortex.cpp predicts base on hardware and model.yaml

System Monitoring extension

Current implementation

App get resources information

graph LR

App --> |getResourcesInfo| Model_Extension
Model_Extension --> |fetch| node-os-utils
Model_Extension --> |getCurrentLoad| nvidia-smi

Possible Changes

Current ❌ Upcoming ✅
Implementation - Depends on FS & CMD Abstraction - API Forwarding
Execute CMD GET - Hardware Information Endpoint

Overview

Current ❌
Image
Upcoming ✅
Image

Assumption

Challenges of moving Nitro to cortex.cpp

The migration

Let's think about a couple of our principles.

  1. We don't remove or manipulate user data.
  2. Rollback should always work.
  3. Minimal migration

    What are some of the main concerns here?

  4. Can we use model.json and model.yaml side by side?
    1. We should. Since the model folder can contain anything, from README.md, .gitignore, GGUF, model.yaml to model.json.
    2. Older versions will still function with legacy model.json files.
    3. Newer versions will work with the latest model.yaml files.
  5. How to sync between those two?
    1. It's hard to sync between those two, since different structures could break the app.
    2. We just try to migrate once when there is no models.list available. This is a good flag for migration triggering.
    3. After migrating, each app version works independently with its own model file format.
  6. How about model pre-population? In other words, Model Hub.
    1. Model pre-population is an anti-pattern. Pre-populated models do not work with versioning or create unwanted data that confuse users. How about our Model Hub list thousands of models?
    2. We implemented model import, which replaces the need for a model file. Users can just import with the HF repo ID. Users do not have any reason to duplicate or edit a pre-populated model.json.
    3. Model listing can be done from the extension.
    4. In short, in the next version, we don't pre-populate unwanted files to the Data Folder. Only when users decide to download.
    5. Users deleting a model means deleting the persisted model.yaml & model files.
  7. How do other extensions work with their models? E.g., OpenAI
    1. Remote models can be populated during the build, not persisted. registerModels now persists in-memory model DTO.
    2. We don't pre-populate remote models, which is not necessary. Users are better setting them from Extension Settings. It's more or less an Extension configuration, not Model population.
  8. Migration complexity and UX
    1. We don't convert model.json to model.yaml. Instead, import with symlink. It could be faster and avoid new logic added from Jan, which is redundant. Lightweight migration with less risk. Maintain the Model ID is key; otherwise, all threads break.
    2. We don't move any files, which could drag the migration process long. E.g., GGUF
    3. How about new/manual adding GGUFs? The model symlink feature is always there for that.
    4. There are bad migration experiences in the past that we can avoid such as:
      1. Migrate all pre-populated models
      2. Heavy file movement drags the duration long
      3. Migrate everything at once
    5. Now we just migrate downloaded models:
      1. Import downloaded models only as symlinks (no file movement)
      2. Don't update the ID, which will kill us on data inconsistency
      3. Another thought: Do we really need to wait for model.yaml creation during migration?
        1. cortex.cpp can work with the models.list to provide available models?
        2. model.yaml generation is an asynchronous operation so:
          1. It generates model.yaml as soon as user try to get or load.
          2. It generates model.yaml as soon as user try to import.
          3. Don't block the client GUI; model list can be done with just the models.list contents. Any further operations with a certain model can generate a model.yaml later.
          4. Client will prioritize the active thread's model then others to not blocking users working threads.
          5. If something goes wrong, the GGUF file will still be there and can be generated later on other operations. The model.yaml file is not strictly required to be available, but just the cache of model file metadata?
  9. Better cache mechanism
    1. Model list and detail have worked with the File System before, and now they're sending an API request to cortex.cpp.
    2. To prevent slow loading, the client should cache accordingly on the frontend.

      Summary

      In short, the entire migration process is just to create symlinks to downloaded models from models.list. No model.yaml or folder manipulation involved. It should be done instantly?

Migrate indicator: models.list exist.

Don't pre-populate models. Remote Extensions work with their own settings instead of pre-populated models. Cortex Extension registers in-memory available to pull models (templates).

cortex.cpp is a daemon process that should be started alongside the app.

Jan migration from 0.5.x to 0.6.0
Image
louis-jan commented 2 months ago

Bundled Engines

Is it possible that, cortex-cpp bundles multiple engines, but expose only 1 gateway?

E.g. The client requests to load a llama.cpp model, but cortex.cpp can predict the hardware compatible and run an efficient binary.

So:

Eventually, that's all it needs to work with – the Model ID (aka model name).

Simplify model load / chat completions request
Image
louis-jan commented 2 months ago

Incremental Path

  1. We do what's not related to cortex.cpp first - Remote Extensions & Pre-populated Models
    1. Rather than pre-populate, enhance the model configurations.
    2. registerModels now lists available models for download, don't persist model.json.
  2. Better data caching
    1. Data retrieved from extensions should be cached on the frontend for subsequent loads.
    2. Reduce direct API requests and perform more data synchronization operations.
    3. Implementing a good cache layer would save a bad user experience during migration later, where the app doesn't need to scan through the models list, but can just dump cached data and imports right away. It won't interrupt users' working threads since asynchronous operations take care of data persistence (model.yaml), and model load requests are typically long-delayed responses.
  3. Minimal Migration Steps (cortex-cpp ready)
    1. Generate models.list based on cached data, do not need to scan the Model Folder, which can be costly.
    2. Send model import or symlink requests to generate models.list. It would be great if cortex.cpp could support batch symlinks (import), as that would only require creating a models.list file. The model.yaml files can be generated asynchronously. (This would cover the case user edits the models.list manually)
    3. Update extensions to redirect requests.
    4. The worst-case scenario is when users update from significantly older versions that lack cache improvements. Go through model folders and send import requests. During app update.
sequenceDiagram
    participant App as "App"
    participant Models as "Models"
    participant ModelList as "Model List"
    participant ModelYaml as "Model YAML"

    App->>Models: import
    activate Models
    Models->>ModelList: update models.list
    activate ModelList
    ModelList->>Models: return data
    deactivate ModelList
    Models->>ModelYaml: generate (async)
    activate ModelYaml
    Note right of ModelYaml: generate model.yaml asynchronously
    ModelYaml->>Models: (async) generated
    deactivate ModelYaml
    deactivate Models
0xSage commented 1 month ago

This is really well thought through.

Questions @louis-jan :

  1. What are the specific attributes Jan needs from the get hardware info endpoint?
    • OS info
    • CPU info
    • RAM size (total, utilized)
    • GPU SKU
    • VRAM size (total, utilized)
    • What else? (do you need storage info, or additional unforeseen stats?)
  2. Do you need hardware configuration endpoints? i.e. does Jan ever need to let users change some hardware level configuration.
  3. Dumb question, but do you need engine status endpoints? Or is the level of abstraction good enough at the model level
  4. What are the Cortex sub-process endpoints needed? /keepalive /healthcheck?
louis-jan commented 1 month ago

@0xSage All great questions, related issue that we discussed Hardware Info endpoint. https://github.com/janhq/cortex.cpp/issues/1165#issuecomment-2355343899

  1. Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...)
  2. Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.
  3. /healthcheck is needed, and implemented from cortex.cpp

The one blocking this is download progress sync, which we aligned with a socket approach.

0xSage commented 1 month ago

Nice, always great when the best answer is the simplest!! 🙏

On Mon, Sep 30, 2024 at 10:35 AM Louis @.***> wrote:

@0xSage https://github.com/0xSage All great questions, related issue that we discussed Hardware Info endpoint. janhq/cortex.cpp#1165 (comment) https://github.com/janhq/cortex.cpp/issues/1165#issuecomment-2355343899

  1. Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...)
  2. Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.
  3. /healthcheck is needed, and implemented from cortex.cpp

The one blocking this is download progress sync, which we aligned with a socket approach.

— Reply to this email directly, view it on GitHub https://github.com/janhq/jan/issues/3690#issuecomment-2381888887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQVWFCD6FUKNF7B3EWM42Q3ZZC2IBAVCNFSM6AAAAABOLJYFMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRHA4DQOBYG4 . You are receiving this because you were mentioned.Message ID: @.***>

louis-jan commented 1 month ago

Package cortex.cpp into Jan app:

  1. The app will bundle all available cortex.cpp binaries, just like the cortex.cpp installer.
  2. During update, it executes cortex engines install --sources [path_to_binaries] or via API, cortex will detect hardware and install accordingly (from sources).
  3. App Settings
    • Use cortex engines get to see the installed variant -> Update UI accordingly, e.g. GPU On/Off.
    • cortex.cpp will include a flag to select the variant, letting users choose their GPU from the app settings. Then, it will install the appropriate cortex engine version accordingly. Since all binaries are included, it's simply about switching between variants. (https://github.com/janhq/cortex.cpp/issues/1390)
    • Investigating on Multiple GPUs support... TBD

Pros

Cons

cortex.cpp configurations on spawn

louis-jan commented 1 month ago

Seamless Models Imports & Run

Since we no longer use models.list, it's now simply based on the engine name for imports.

  1. App 0.6.x opens (given that users have updated to 0.5.5)
  2. Trigger get models on load as usual (extension - asynchronously - background).
  3. Go through the downloaded models (from the cache) — very fast since only previously downloaded models are involved.
  4. Send model import requests for nitro models.
  5. Persist the cache.

This operation runs asynchronously, won't affect the UX, since it works with cached data. Even the model is not imported yet, it should still function normally (with stateless model load endpoint). There will be a door for broken requests or attempts.

In case users are updating from a very old version, we run a scan on model.json and persist the cache (as the current - legacy logics) -> Continue with (1)

dan-homebrew commented 1 month ago

Current issues being faced:

dan-homebrew commented 1 month ago

@dan-homebrew: will create an Implementation Issue: @louis-jan Can you link the implementation-related issues here: