Understanding Variable Response Times from a Local Mistral Model

fentresspaul61B commented 4 months ago

Hello,

Thank you for the great package, I have been exploring it, and considering using this as a way to host an LLM on my companies Linux server. Our overarching goal was to move away from using the OpenAI API, due to its long and varying response times from the API, and host an open source LLM on our own machine. It seems that Ollama may work for this use case.

However, after doing some initial prompts with the Mistral model, I started to notice the response times were not consistent. I am currently running the code on a CPU on our Linux server (so I did not expect fast response times); however, I did expect consistent response times. From what I understand, the Ollama python packages uses the models that are downloaded to your machine, and therefore are running offline, on that machine, not in the cloud by a third party server (correct me if I am wrong here).

I was surprised to see my first initial responses have around 60 second response time, and then that response time drop down to 10 and 7 seconds, when using the exact same prompt. This behavior is reminds me of cold starts when using serverless infra, but again, I am assuming that this is a local offline model.

I also waited a few minutes, and re ran the script, and got the initial 1 minute response time, then 5 second response times again, so it seems like there some type of initialization or cold start?

Here is the code (hiding the prompt), simply taken from your README.


import ollama

s = time.time()

response = ollama.chat(model='mistral', messages=[
  {
    'role': 'user',
    'content': TEST_PROMPT,
  },
])
print(response['message']['content'])
e = time.time()

Response for exact same input:

(AI) paul@deva-companion-python:/python_backend/AI$ python3 ollama_test.py
 Hi Pippy, this bedroom has so many interesting things. What catches your eye?
Response time: 65.01464176177979
(AI) paul@deva-companion-python:/python_backend/AI$ python3 ollama_test.py
 "Hi Pippy, this bedroom has so many interesting things. What catches your eye?"
Response time: 5.180686712265015
(AI) paul@deva-companion-python:/python_backend/AI$ python3 ollama_test.py
 "Hi Pippy, this bedroom has so many interesting things. What catches your eye?"
Response time: 5.081787586212158
(AI) paul@deva-companion-python:/python_backend/AI$ python3 ollama_test.py
 Hello Pippy! I'm glad you're here. This bedroom has so many interesting things. What catches your eye?
Response time: 6.901242017745972

Some Questions

Is the variable length of response times from an LLM (given the same input) considered normal behavioral?
Why such large differences dropping down from 1 minute to a few seconds?
Is there some type of initialization or cold start happening for the first call? If so why?
Is there some type of caching mechanism with mistral causing this?
Is this due to the random nature of a generative model? (I am more familiar with deterministic models, which have consistent inference times given the same input). (I doubt this, because the varied response tiems are consistent, meaning the first call is 60 seconds, then after that about 5-6).
Am I wrong in my assumption that the python package runs offline?

Thanks again,

Paul

mxyng commented 4 months ago

The high timing on the first run is because Ollama needs to load the model into (GPU) memory. Subsequent requests take roughly the same amount of time. There will still be some variance since the outputs will be different. You can set options={'temperature': 0} for reproducible result and more consistent timings.

By default, the model gets unloaded after 5 minutes. Once the model gets unloaded, the next request needs to load the model back into memory. You can change this behaviour by setting keep_alive, e.g. ollama.chat(model='...', messages=[...], keep_alive=...). keep_alive can be either a duration (60s, 5m, 1h) or a number (in seconds). Negative keep_alive, -1, will keep the model in memory indefinitely

fentresspaul61B commented 3 months ago

Thank you for the response. That hopefully will resolve the loading/ cold start issue. I am still trying to understand how these different parts work together.

I was wondering if you could help me with this question that I originally proposed:

"From what I understand, the Ollama python packages uses the models that are downloaded to your machine, and therefore are running offline, on that machine, not in the cloud by a third party server (correct me if I am wrong here)."

mxyng commented 3 months ago

From what I understand, the Ollama python packages uses the models that are downloaded to your machine, and therefore are running offline, on that machine, not in the cloud by a third party server (correct me if I am wrong here).

This is mostly correct. This library implements a Python interface to the Ollama API. As such, it requires a running Ollama instance which can be local (same machine) or remote (hosted by you or someone you trust). The default configurations assume local (127.0.0.1:11434 to be precise). This Ollama instance is responsible for downloading and running models.

ollama / ollama-python

Understanding Variable Response Times from a Local Mistral Model #78

Some Questions