Add `keep_alive` parameter to Ollama requests

jasalt commented 1 day ago

The time Ollama model keeps loaded in memory (in seconds) after request could be set as a parameter to the Ollama Chat API: https://github.com/ollama/ollama/pull/2146#issuecomment-2282889389. Maybe it would be a useful addition in gptel settings, if it makes sense. It defaults to 5 minutes which gets annoying with bigger models.

Alternatively it can be set with env var OLLAMA_KEEP_ALIVE=<SECONDS> before running ollama serve.

karthink commented 1 day ago

There are hundreds of options (across backends), so exposing them all as gptel configuration variables does not scale. Thankfully what you're looking for can be done easily when defining the Ollama backend:

(gptel-make-ollama "Ollama"
  :host "localhost:11434"
  :models '(openhermes:latest)
  :stream t
  :request-params '(:keep_alive "60m"))

You can add any other request parameters you need to :request-params this way. If you want different keep_alive settings for different models you can specify it per model instead. See #330 and #471 for more details and examples.

jasalt commented 1 day ago

Right, it did not come to my mind to look for a generic :request-params config option, thank's. Seems like a good solution.

karthink / gptel

Add `keep_alive` parameter to Ollama requests #477