I would like to float a thought balloon for Nitro to expand its feature set, and automagically handle model loading/unloading ops and abstract it from the Dev
The rise of multi-modal AI makes loading/unloading, VRAM and RAM management important
This represents an opportunity for us to grow Nitro's value add layer, vs. just being a "server.cpp"
Reasoning
Developers new to local AI are used to OpenAI, where they don't have to think about model loading/unloading
Devs will just call chat/completions with different model names
Devs will just call chat/completions with images, sound, files etc
As we progress towards multi-modal AI on resource-constrained devices, "model ops" is a concern
Hear, speak, think all require different models (even if we move to LLaVa style models)
Memory management, especially on constrained devices, is a field ripe for optimization
Jan will primarily be a VSCode-tier product (i.e. application layer)
Nitro & Jan should have clean abstractions driven by architectural decisions
I see Jan's roadmap as being very driven by Assistants, while leaving local model management to Nitro
Imho, this functionality should live with Nitro
Proposal
Nitro be "model aware"
A Nitro server would be "aware" of what models it has access to
Naturally, this would be via a /models folder (aligning with Jan has significant business benefits for us)
This would allow users to pass in model params in chat/completions, similar to OpenAI
Nitro to abstract model loading/unloading from user
Devs would just send chat/completions requests to Nitro
Nitro will handle loading/unloading (we just assume requests can always be queued in-memory)
Nitro -may- send "statuses" via SSE to requester, to indicate model loading time lag (Dan's question: is this a good idea?)
Nitro to have a /models endpoint
We would refactor current Nitro models endpoints:
Status Quo
Recommended Change
Function
/modelstatus
GET v1/models
Return state of models in Nitro
/loadmodel
POST v1/models/load { ... params }
Loads model
/unloadmodel
POST v1/models/unload { ...params }
Unloads model
Nitro to be aware of available RAM/VRAM
Debatable: Nitro should be aware of available RAM/VRAM
Nitro should figure out when to unload models (a lot of good techniques from RAM/Memory principles)
Open Questions
Q: What happens if a user wants to load a model with specific params?
A: Nitro still retains a POST models/load endpoint for this purpose, and subsequent calls to Model will have those params (unless unloaded)
A: In my opinion, the most Dev-friendly approach is to allow them to define a model.json with load configs, I don't see most developers needing to change it frequently?
[Nitro] loads Mistral, and sends chat/completions back
[User]
# Request 2 to Neural Chat (a bin file)
curl http://localhost:3928/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "neuralchat-7b-ggufv3.q8.bin",
...
}
[Nitro] sees that it has enough RAM to hold both Mistral and Neuralchat, and proceeds load Neuralchat
[User]
# Request 3 to Codellama, by passing in path to model
curl http://localhost:3928/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "/path/to/codellama-34b-gguf.q4_k_m.bin", # Model not in folder
...
}
[Nitro] Sees that Codellama is fairly large, and proceeds to unload Mistral but not Neuralchat
Objective
Reasoning
chat/completions
with different model nameschat/completions
with images, sound, files etcProposal
Nitro be "model aware"
/models
folder (aligning with Jan has significant business benefits for us)model
params inchat/completions
, similar to OpenAINitro to abstract model loading/unloading from user
chat/completions
requests to NitroNitro to have a
/models
endpointWe would refactor current Nitro models endpoints:
Nitro to be aware of available RAM/VRAM
Open Questions
Q: What happens if a user wants to load a model with specific params? A: Nitro still retains a
POST models/load
endpoint for this purpose, and subsequent calls to Model will have those params (unless unloaded) A: In my opinion, the most Dev-friendly approach is to allow them to define amodel.json
with load configs, I don't see most developers needing to change it frequently?Example
Developer populates Model folder:
Developer starts Nitro server:
Developer proceeds to use Nitro