janhq / cortex.cpp

Run and customize Local LLMs.
https://cortex.so
Apache License 2.0
1.97k stars 111 forks source link

Discussion: Nitro to automagically handle Model Operations? #175

Closed dan-homebrew closed 9 months ago

dan-homebrew commented 11 months ago

Objective

Reasoning

Proposal

Nitro be "model aware"

Nitro to abstract model loading/unloading from user

Nitro to have a /models endpoint

We would refactor current Nitro models endpoints:

Status Quo Recommended Change Function
/modelstatus GET v1/models Return state of models in Nitro
/loadmodel POST v1/models/load { ... params } Loads model
/unloadmodel POST v1/models/unload { ...params } Unloads model

Nitro to be aware of available RAM/VRAM

Open Questions

Q: What happens if a user wants to load a model with specific params? A: Nitro still retains a POST models/load endpoint for this purpose, and subsequent calls to Model will have those params (unless unloaded) A: In my opinion, the most Dev-friendly approach is to allow them to define a model.json with load configs, I don't see most developers needing to change it frequently?

Example

Developer populates Model folder:

# Folder structure
 /models
      /mistral-7b                            # { model-id }
            mistral-7b-gguf.q7.bin
            model.json                      # default params
     neuralchat-7b-ggufv3.q8.bin

Developer starts Nitro server:

# Nitro server
nitro --models ~/models

Developer proceeds to use Nitro

[User] 
# Request 1 to Mistral
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "mistral-7b",
     ...
  }
[Nitro] loads Mistral, and sends chat/completions back
[User] 
# Request 2 to Neural Chat (a bin file)
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "neuralchat-7b-ggufv3.q8.bin",
     ...
  }
[Nitro] sees that it has enough RAM to hold both Mistral and Neuralchat, and proceeds load Neuralchat
[User]
# Request 3 to Codellama, by passing in path to model
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "/path/to/codellama-34b-gguf.q4_k_m.bin",   # Model not in folder
     ...
  }
[Nitro] Sees that Codellama is fairly large, and proceeds to unload Mistral but not Neuralchat
dan-homebrew commented 11 months ago

Also cc @louis-jan @vuonghoainam For separation of concerns between Jan and Nitro

hiro-v commented 9 months ago

Done for the discussion, will mention this in subsequent issues