Ollama times out at 5m - Githubissues

petrm commented 2 weeks ago

Describe the Bug

I am running AI tagging with Ollama and I have set the timeout to INFERENCE_JOB_TIMEOUT_SEC=1200. This is not respected, if the query runs longer than 5 minutes, it just fails without any error message and runs the inference job again. I looked through the ollama bug tracker and there have been similar issues reported. They were all resolved with the message that it is the client closing the connection and it was suggested to switch to a streamed response. I also checked the ollama source code for any timeouts and found out that OLLAMA_KEEP_ALIVE has a default value of 5m. Changing this to a larger value does not have any effect on this issue.

hoarder  | 2024-11-07T18:32:25.196018160Z 2024-11-07T18:32:25.195Z info: [inference][12769] Starting an inference job for bookmark with id "mi2cf03dzd8uk1vzsljx5ech"
ollama   | 2024-11-07T18:37:25.368762014Z [GIN] 2024/11/07 - 18:37:25 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-07T18:37:25.397233337Z 2024-11-07T18:37:25.396Z info: [inference][12769] Starting an inference job for bookmark with id "mi2cf03dzd8uk1vzsljx5ech"
ollama   | 2024-11-07T18:42:25.575034648Z [GIN] 2024/11/07 - 18:42:25 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-07T18:42:25.602108390Z 2024-11-07T18:42:25.601Z info: [inference][12769] Starting an inference job for bookmark with id "mi2cf03dzd8uk1vzsljx5ech"
ollama   | 2024-11-07T18:47:25.773340933Z [GIN] 2024/11/07 - 18:47:25 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-07T18:47:25.795902008Z 2024-11-07T18:47:25.795Z info: [inference][12769] Starting an inference job for bookmark with id "mi2cf03dzd8uk1vzsljx5ech"

Using curl against ollama confirms that queries running longer than 5 minutes can be completed successfully:

time curl http://ollama:11434/api/generate -d '{"model": "gemma2:9b-instruct-q4_K_M", "prompt": "Why is the sky blue? Answer with 1100 words.", "stream": false}'
real    8m18.745s
user    0m0.018s
sys     0m0.022s

Steps to Reproduce

Set INFERENCE_JOB_TIMEOUT_SEC to a value >5m and run an inference job longer than 5m with ollama.

Expected Behaviour

Inference completes succesfully.

Screenshots or Additional Context

No response

Device Details

No response

Exact Hoarder Version

v18

petrm commented 2 weeks ago

Expanding on my investigation: the hoarder INFERENCE_JOB_TIMEOUT_SEC variable sets the keep_alive value in the request, which explicitly says:

keep_alive <string | number>: (Optional) How long to keep the model loaded.

So in fact this does the same as OLLAMA_KEEP_ALIVE, but at request level.

Looking at the ollama-js library, it seems to be using streaming, so I am not sure where does the timeout come from :( Sometimes I also get an error like this, which I have seen in one of the first hoarder bugs around ollama integration.

ollama   | 2024-11-07T20:24:47.289037279Z [GIN] 2024/11/07 - 20:24:47 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-07T20:24:47.310757138Z 2024-11-07T20:24:47.310Z info: [inference][12794] Starting an inference job for bookmark with id "n19qlnmsqhlf4emhdnnd3kg4"
ollama   | 2024-11-07T20:29:47.463722205Z [GIN] 2024/11/07 - 20:29:47 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-07T20:29:47.499741771Z 2024-11-07T20:29:47.497Z info: [inference][12794] Starting an inference job for bookmark with id "n19qlnmsqhlf4emhdnnd3kg4"
ollama   | 2024-11-07T20:34:47.667035483Z [GIN] 2024/11/07 - 20:34:47 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-07T20:34:47.685824389Z 2024-11-07T20:34:47.685Z info: [inference][12794] Starting an inference job for bookmark with id "n19qlnmsqhlf4emhdnnd3kg4"
ollama   | 2024-11-07T20:39:47.820565942Z [GIN] 2024/11/07 - 20:39:47 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-07T20:39:47.820825324Z 2024-11-07T20:39:47.820Z error: [inference][12794] inference job failed: TypeError: fetch failed
hoarder  | 2024-11-07T20:39:47.820845081Z TypeError: fetch failed
hoarder  | 2024-11-07T20:39:47.820849760Z     at node:internal/deps/undici/undici:13185:13
hoarder  | 2024-11-07T20:39:47.820853838Z     at async post (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:114:20)
hoarder  | 2024-11-07T20:39:47.820857765Z     at async Ollama.processStreamableRequest (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:232:25)
hoarder  | 2024-11-07T20:39:47.820861903Z     at async OllamaInferenceClient.runModel (/app/apps/workers/inference.ts:2:3086)
hoarder  | 2024-11-07T20:39:47.820876501Z     at async OllamaInferenceClient.inferFromText (/app/apps/workers/inference.ts:2:3790)
hoarder  | 2024-11-07T20:39:47.820879146Z     at async inferTagsFromText (/app/apps/workers/openaiWorker.ts:6:2663)
hoarder  | 2024-11-07T20:39:47.820881630Z     at async inferTags (/app/apps/workers/openaiWorker.ts:6:2886)
hoarder  | 2024-11-07T20:39:47.820884175Z     at async Object.runOpenAI [as run] (/app/apps/workers/openaiWorker.ts:6:6316)
hoarder  | 2024-11-07T20:39:47.820886810Z     at async Runner.runOnce (/app/apps/workers/node_modules/.pnpm/@hoarder+queue@file+packages+queue/node_modules/@hoarder/queue/runner.ts:2:2567)

MohamedBassem commented 2 weeks ago

ok, that's weird. If your job timeout is 20mins. You shouldn't see inference jobs starting at 5mins intervals. Also you shouldn't have seen at least some errors between the runs. Let's start with the obvious question, did you run 'docker compose up' after updating the env file?

petrm commented 1 week ago

Yeah, ollama is up and reacheable. I can see the behavior from my comment above with some of the bookmarks: hoarder tries to run inference on the same bookmark four times, I always see 5min POST in the ollama log. On the last one the exception comes.

hoarder  | 2024-11-08T17:31:38.931997318Z 2024-11-08T17:31:38.931Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"         
hoarder  | 2024-11-08T17:36:39.111826101Z 2024-11-08T17:36:39.111Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"         
hoarder  | 2024-11-08T17:41:39.364475327Z 2024-11-08T17:41:39.363Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"         
hoarder  | 2024-11-08T17:46:39.578103571Z 2024-11-08T17:46:39.577Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"         
hoarder  | 2024-11-08T17:51:39.773566800Z 2024-11-08T17:51:39.773Z error: [inference][12868] inference job failed: TypeError: fetch failed                                    
hoarder  | 2024-11-08T17:51:39.773600023Z TypeError: fetch failed                                                                                                             
hoarder  | 2024-11-08T17:51:39.773610232Z     at node:internal/deps/undici/undici:13185:13                                                                                    
hoarder  | 2024-11-08T17:51:39.773618879Z     at async post (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:114:20)
hoarder  | 2024-11-08T17:51:39.773628357Z     at async Ollama.processStreamableRequest (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:232:25)
hoarder  | 2024-11-08T17:51:39.773637484Z     at async OllamaInferenceClient.runModel (/app/apps/workers/inference.ts:2:3086)
hoarder  | 2024-11-08T17:51:39.773645780Z     at async OllamaInferenceClient.inferFromText (/app/apps/workers/inference.ts:2:3790)
hoarder  | 2024-11-08T17:51:39.773654115Z     at async inferTagsFromText (/app/apps/workers/openaiWorker.ts:6:2663)
hoarder  | 2024-11-08T17:51:39.773662692Z     at async inferTags (/app/apps/workers/openaiWorker.ts:6:2886)
hoarder  | 2024-11-08T17:51:39.773671088Z     at async Object.runOpenAI [as run] (/app/apps/workers/openaiWorker.ts:6:6316)
hoarder  | 2024-11-08T17:51:39.773679774Z     at async Runner.runOnce (/app/apps/workers/node_modules/.pnpm/@hoarder+queue@file+packages+queue/node_modules/@hoarder/queue/runner.ts:2:2567)
ollama   | 2024-11-08T17:31:38.907763254Z [GIN] 2024/11/08 - 17:31:38 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
ollama   | 2024-11-08T17:36:39.093509907Z [GIN] 2024/11/08 - 17:36:39 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
ollama   | 2024-11-08T17:41:39.326055551Z [GIN] 2024/11/08 - 17:41:39 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
ollama   | 2024-11-08T17:46:39.555181745Z [GIN] 2024/11/08 - 17:46:39 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
ollama   | 2024-11-08T17:51:39.773308500Z [GIN] 2024/11/08 - 17:51:39 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"

I also tried to increase the hoarder timeout to a large value (1h) just to confirm that something funny like a 20m timeout being an aggregation of 4*5min.

petrm commented 1 week ago

This is getting weirder every time I look at it. I have been going through the logs and I also see cases like this:

hoarder  | 2024-11-09T07:54:05.516343115Z 2024-11-09T07:54:05.516Z info: [inference][12916] Starting an inference job for bookmark with id "p5mlkh78gd9s0xux6vfk2qe4"
ollama   | 2024-11-09T07:59:05.698825391Z [GIN] 2024/11/09 - 07:59:05 | 200 |          5m0s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-09T07:59:05.722073635Z 2024-11-09T07:59:05.721Z info: [inference][12916] Starting an inference job for bookmark with id "p5mlkh78gd9s0xux6vfk2qe4"
ollama   | 2024-11-09T08:00:36.786488401Z [GIN] 2024/11/09 - 08:00:36 | 200 |         1m31s |     172.22.0.39 | POST     "/api/chat"
hoarder  | 2024-11-09T08:00:36.787668548Z 2024-11-09T08:00:36.787Z info: [inference][12916] Inferring tag for bookmark "p5mlkh78gd9s0xux6vfk2qe4" used 1662 tokens and inferred: database,open source,data management,postgresql,software engineering,web application
hoarder  | 2024-11-09T08:00:36.820932932Z 2024-11-09T08:00:36.820Z info: [inference][12916] Completed successfully

This would suggest my previous idea that there is some sort of aggregated timeout.

petrm commented 1 week ago

Is there any chance this could have something to do with the retries? https://github.com/hoarder-app/hoarder/blob/8f44c81b52c8ef5999929c225b3f883d00c157ec/packages/shared/queues.ts#L44 I still however can't find out where are the 5 minutes coming from.

kamtschatka commented 1 week ago

i am not sure what you mean with "this". What I know is that the logging is currently VERY confusing, since timeouts/retries of jobs do not log anything. The error handler for jobs is only invoked once it is completely failed (after all the retries), where it will then log the actual problem of a task. In between you just simply see that that the task was started again (without any info that the previous execution failed or why it failed)

kamtschatka commented 1 week ago

i am thinking it might be this: https://github.com/ollama/ollama-js/issues/103#issuecomment-2227531780

The comment suggests that it has nothing to do with anything in ollama or hoarder, but with the default timeout for fetch requests on nodejs, which seems to be 5 minutes.

MohamedBassem commented 1 week ago

@kamtschatka

is that the logging is currently VERY confusing

Yeah, I agree. I'm also as surprised as you are that I ended up implementing it that way :D I can fix that.

fetch requests on nodejs, which seems to be 5 minutes.

That indeed looks very likely to be the case.

petrm commented 1 week ago

i am not sure what you mean with "this". What I know is that the logging is currently VERY confusing, since timeouts/retries of jobs do not log anything. The error handler for jobs is only invoked once it is completely failed (after all the retries), where it will then log the actual problem of a task. In between you just simply see that that the task was started again (without any info that the previous execution failed or why it failed)

Sorry, the context was just in my brain and not in the comment. You described pretty much what I meant, that example shows a successful inference and yet there is the 5m timeout. I can't explain why is the second request so short and returns correctly.

Your deduction of the timeout in nodejs fetch being at 5 minutes is what I have been looking for. One thing does not add up here for me though. What kind of timeout is it? If it would be a connection timeout (like curl), it would only happen, if the server would not respond within 5 minutes. It looks more like a maximum request duration, but I would expect it to not apply in case of a streaming response.

MohamedBassem commented 1 week ago

but I would expect it to not apply in case of a streaming response.

@petrm Maybe the time to first token is higher than 5 minutes?

petrm commented 1 week ago

i am thinking it might be this: ollama/ollama-js#103 (comment)

The comment suggests that it has nothing to do with anything in ollama or hoarder, but with the default timeout for fetch requests on nodejs, which seems to be 5 minutes.

I think the bug is mixing different timeouts and also other issues with 5m timeout are being referenced. In one case it is a long loading time - this is not my case, the load is super fast in my case. The other one is the nodejs request duration.

My ollama actually responds with 200 at the 5m mark and not with a 500 like in the referenced issue.

Piggeldi2013 commented 1 week ago

I've the same problem and saw 500 at the 5 Minute mark, but also 200 sometimes, whereas the 200s have been way more often than the 500s.

MohamedBassem commented 1 week ago

ok, multiple people are hitting this now. So it's clearly a bug. I'm convinced that it's probably the default fetch timeout thing.

sbarbett commented 1 week ago

It is the fetch timeout. I "patched" it by doing the following -

Copy the ollama node module off the container.

sudo docker cp hoarder-web-1:/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs ./ollama.9c897541.cjs

Edit this file and add the following.

const { Agent } = require('undici');

// No-timeout fetch function
const noTimeoutFetch = (input, init = {}) => {
return fetch(input, {
  ...init,
  dispatcher: new Agent({ headersTimeout: 3600000 }), // 1-hour timeout
});
};

In the Ollama class constructor, replace this.fetch = fetch; with...
```
this.fetch = noTimeoutFetch;
```

Then I edited my Docker CF, mounted this file as a volume to the web service and recreated the container.

services:
...
web:
  ...
  volumes:
    - /docker/appdata/hoarder/data:/data
    - ./ollama.9c897541.cjs:/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs

Basically just copied the solution someone linked above.

Piggeldi2013 commented 1 week ago

Can confirm that @sbarbett changes mentioned does indeed resolve the issue.

Nevertheless, as far as I can see, this is a change in a library and not in hoarder itself - otherwise I would have done that in a PR. Unfortunately, I'm not into next.js that deep to support. Sorry...

sbarbett commented 1 week ago

I agree that we shouldn't be hand-modifying the node modules in the way I described above. The issue needs to be addressed upstream, i.e. someone should open a discussion with undici or the Nodejs team and ask for a way to make headersTimeout globally configurable (if it isn't already, I didn't dig too deep here and am just going off @PawelJ-PL's comment from the other thread).

An interim solution would be to make a custom wrapper. By default, it will be 5 minutes unless someone adds an INFERENCE_FETCH_TIMEOUT_SEC variable to their environment.

Call this customFetch.js:

const { Agent } = require('undici');

// Default timeout of 5 minutes in milliseconds
const defaultTimeout = 5 * 60 * 1000;

// Custom fetch function that uses a custom timeout
const noTimeoutFetch = (input, init = {}) => {
  // Get timeout from environment variable or use default
  const timeout = process.env.INFERENCE_FETCH_TIMEOUT_SEC 
    ? parseInt(process.env.INFERENCE_FETCH_TIMEOUT_SEC, 10) * 1000
    : defaultTimeout;

  // Set up an Agent with the configured timeout
  return fetch(input, {
    ...init,
    dispatcher: new Agent({ headersTimeout: timeout }),
  });
};

module.exports = { noTimeoutFetch };

Then wherever the Ollama client is initiated in the Hoarder code, this needs imported and passed to the constructor.

const { noTimeoutFetch } = require('../utils/customFetch');  // Adjust path as necessary
// ...
const Ollama = require('ollama-js');

const ollamaClient = new Ollama({
  host: appConfig.OLLAMA_BASE_URL,
  fetch: noTimeoutFetch, // Use custom fetch here
});

In the YAML we'd just add this to the environment like so.

services:
  web:
    environment:
      - INFERENCE_FETCH_TIMEOUT_SEC=3600  # 1 hour fetch timeout

Down the road, if they ever implement a global timeout, you'd just get rid of the custom wrapper and map the INFERENCE_FETCH_TIMEOUT_SEC environment variable directly.

sbarbett commented 1 week ago

I wrote a custom wrapper to fix this bug and submitted a pull request.

hoarder-app / hoarder

Ollama times out at 5m #628

Describe the Bug

Steps to Reproduce

Expected Behaviour

Screenshots or Additional Context

Device Details

Exact Hoarder Version