Closed sammcj closed 1 year ago
Description
It would be awesome if there was an API (or openAI API extension) endpoint that you could use to:
This would allow hot loading of a model for a specific task, then unloading it again to reduce idle resource consumption etc...
Additional Context
LocalAI has this functionality which is really useful, it works as such:
curl http://localhost:8080/v1/models # {"object":"list","data":[{"id":"ggml-gpt4all-j","object":"model"}]} curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "ggml-gpt4all-j", "messages": [{"role": "user", "content": "How are you?"}], "temperature": 0.9 }'
etc...
I did see that the OpenAI compatible API extension has some functionality for this but it's been marked as legacy:
/v1/engines/{model_name} | openai engines.get -i {model_name} | You can use this legacy endpoint to load models via the api or command line
There's the api/v1/model endpoint. An example is here to load models and to list them.
api/v1/model
Oh my gosh, how did I miss that! I even looked through those examples again today 🤣 🤦
Description
It would be awesome if there was an API (or openAI API extension) endpoint that you could use to:
This would allow hot loading of a model for a specific task, then unloading it again to reduce idle resource consumption etc...
Additional Context
LocalAI has this functionality which is really useful, it works as such:
etc...
I did see that the OpenAI compatible API extension has some functionality for this but it's been marked as legacy: