Add API endpoint to load/unload model

sammcj commented 1 year ago

Description

It would be awesome if there was an API (or openAI API extension) endpoint that you could use to:

load a model
unload a model
list available models

This would allow hot loading of a model for a specific task, then unloading it again to reduce idle resource consumption etc...

Additional Context

LocalAI has this functionality which is really useful, it works as such:

curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"ggml-gpt4all-j","object":"model"}]}

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-j",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'

etc...

I did see that the OpenAI compatible API extension has some functionality for this but it's been marked as legacy:

/v1/engines/{model_name} | openai engines.get -i {model_name} | You can use this legacy endpoint to load models via the api or command line

MasX commented 1 year ago

There's the api/v1/model endpoint. An example is here to load models and to list them.

sammcj commented 1 year ago

Oh my gosh, how did I miss that! I even looked through those examples again today 🤣 🤦

oobabooga / text-generation-webui

Add API endpoint to load/unload model #3794