Discussion: Nitro to automagically handle Model Operations?

dan-homebrew commented 11 months ago

Objective

I would like to float a thought balloon for Nitro to expand its feature set, and automagically handle model loading/unloading ops and abstract it from the Dev
The rise of multi-modal AI makes loading/unloading, VRAM and RAM management important
This represents an opportunity for us to grow Nitro's value add layer, vs. just being a "server.cpp"

Reasoning

Developers new to local AI are used to OpenAI, where they don't have to think about model loading/unloading
- Devs will just call chat/completions with different model names
- Devs will just call chat/completions with images, sound, files etc
As we progress towards multi-modal AI on resource-constrained devices, "model ops" is a concern
- Hear, speak, think all require different models (even if we move to LLaVa style models)
- Memory management, especially on constrained devices, is a field ripe for optimization
Jan will primarily be a VSCode-tier product (i.e. application layer)
- Nitro & Jan should have clean abstractions driven by architectural decisions
- I see Jan's roadmap as being very driven by Assistants, while leaving local model management to Nitro
- Imho, this functionality should live with Nitro

Proposal

Nitro be "model aware"

A Nitro server would be "aware" of what models it has access to
Naturally, this would be via a /models folder (aligning with Jan has significant business benefits for us)
This would allow users to pass in model params in chat/completions, similar to OpenAI

Nitro to abstract model loading/unloading from user

Devs would just send chat/completions requests to Nitro
Nitro will handle loading/unloading (we just assume requests can always be queued in-memory)
Nitro -may- send "statuses" via SSE to requester, to indicate model loading time lag (Dan's question: is this a good idea?)

Nitro to have a `/models` endpoint

We would refactor current Nitro models endpoints:

Status Quo	Recommended Change	Function
/modelstatus	GET v1/models	Return state of models in Nitro
/loadmodel	POST v1/models/load { ... params }	Loads model
/unloadmodel	POST v1/models/unload { ...params }	Unloads model

Nitro to be aware of available RAM/VRAM

Debatable: Nitro should be aware of available RAM/VRAM
Nitro should figure out when to unload models (a lot of good techniques from RAM/Memory principles)

Open Questions

Q: What happens if a user wants to load a model with specific params? A: Nitro still retains a POST models/load endpoint for this purpose, and subsequent calls to Model will have those params (unless unloaded) A: In my opinion, the most Dev-friendly approach is to allow them to define a model.json with load configs, I don't see most developers needing to change it frequently?

Example

Developer populates Model folder:

# Folder structure
 /models
      /mistral-7b                            # { model-id }
            mistral-7b-gguf.q7.bin
            model.json                      # default params
     neuralchat-7b-ggufv3.q8.bin

Developer starts Nitro server:

# Nitro server
nitro --models ~/models

Developer proceeds to use Nitro

[User] 
# Request 1 to Mistral
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "mistral-7b",
     ...
  }

[Nitro] loads Mistral, and sends chat/completions back

[User] 
# Request 2 to Neural Chat (a bin file)
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "neuralchat-7b-ggufv3.q8.bin",
     ...
  }

[Nitro] sees that it has enough RAM to hold both Mistral and Neuralchat, and proceeds load Neuralchat

[User]
# Request 3 to Codellama, by passing in path to model
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "/path/to/codellama-34b-gguf.q4_k_m.bin",   # Model not in folder
     ...
  }

[Nitro] Sees that Codellama is fairly large, and proceeds to unload Mistral but not Neuralchat

dan-homebrew commented 11 months ago

Also cc @louis-jan @vuonghoainam For separation of concerns between Jan and Nitro

hiro-v commented 9 months ago

Done for the discussion, will mention this in subsequent issues

janhq / cortex.cpp