Open petrm opened 2 weeks ago
Expanding on my investigation: the hoarder INFERENCE_JOB_TIMEOUT_SEC variable sets the keep_alive value in the request, which explicitly says:
keep_alive <string | number>: (Optional) How long to keep the model loaded.
So in fact this does the same as OLLAMA_KEEP_ALIVE, but at request level.
Looking at the ollama-js library, it seems to be using streaming, so I am not sure where does the timeout come from :( Sometimes I also get an error like this, which I have seen in one of the first hoarder bugs around ollama integration.
ollama | 2024-11-07T20:24:47.289037279Z [GIN] 2024/11/07 - 20:24:47 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
hoarder | 2024-11-07T20:24:47.310757138Z 2024-11-07T20:24:47.310Z info: [inference][12794] Starting an inference job for bookmark with id "n19qlnmsqhlf4emhdnnd3kg4"
ollama | 2024-11-07T20:29:47.463722205Z [GIN] 2024/11/07 - 20:29:47 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
hoarder | 2024-11-07T20:29:47.499741771Z 2024-11-07T20:29:47.497Z info: [inference][12794] Starting an inference job for bookmark with id "n19qlnmsqhlf4emhdnnd3kg4"
ollama | 2024-11-07T20:34:47.667035483Z [GIN] 2024/11/07 - 20:34:47 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
hoarder | 2024-11-07T20:34:47.685824389Z 2024-11-07T20:34:47.685Z info: [inference][12794] Starting an inference job for bookmark with id "n19qlnmsqhlf4emhdnnd3kg4"
ollama | 2024-11-07T20:39:47.820565942Z [GIN] 2024/11/07 - 20:39:47 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
hoarder | 2024-11-07T20:39:47.820825324Z 2024-11-07T20:39:47.820Z error: [inference][12794] inference job failed: TypeError: fetch failed
hoarder | 2024-11-07T20:39:47.820845081Z TypeError: fetch failed
hoarder | 2024-11-07T20:39:47.820849760Z at node:internal/deps/undici/undici:13185:13
hoarder | 2024-11-07T20:39:47.820853838Z at async post (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:114:20)
hoarder | 2024-11-07T20:39:47.820857765Z at async Ollama.processStreamableRequest (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:232:25)
hoarder | 2024-11-07T20:39:47.820861903Z at async OllamaInferenceClient.runModel (/app/apps/workers/inference.ts:2:3086)
hoarder | 2024-11-07T20:39:47.820876501Z at async OllamaInferenceClient.inferFromText (/app/apps/workers/inference.ts:2:3790)
hoarder | 2024-11-07T20:39:47.820879146Z at async inferTagsFromText (/app/apps/workers/openaiWorker.ts:6:2663)
hoarder | 2024-11-07T20:39:47.820881630Z at async inferTags (/app/apps/workers/openaiWorker.ts:6:2886)
hoarder | 2024-11-07T20:39:47.820884175Z at async Object.runOpenAI [as run] (/app/apps/workers/openaiWorker.ts:6:6316)
hoarder | 2024-11-07T20:39:47.820886810Z at async Runner.runOnce (/app/apps/workers/node_modules/.pnpm/@hoarder+queue@file+packages+queue/node_modules/@hoarder/queue/runner.ts:2:2567)
ok, that's weird. If your job timeout is 20mins. You shouldn't see inference jobs starting at 5mins intervals. Also you shouldn't have seen at least some errors between the runs. Let's start with the obvious question, did you run 'docker compose up' after updating the env file?
Yeah, ollama is up and reacheable. I can see the behavior from my comment above with some of the bookmarks: hoarder tries to run inference on the same bookmark four times, I always see 5min POST in the ollama log. On the last one the exception comes.
hoarder | 2024-11-08T17:31:38.931997318Z 2024-11-08T17:31:38.931Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"
hoarder | 2024-11-08T17:36:39.111826101Z 2024-11-08T17:36:39.111Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"
hoarder | 2024-11-08T17:41:39.364475327Z 2024-11-08T17:41:39.363Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"
hoarder | 2024-11-08T17:46:39.578103571Z 2024-11-08T17:46:39.577Z info: [inference][12868] Starting an inference job for bookmark with id "o9101ikzegu3vqphxo7ahfbd"
hoarder | 2024-11-08T17:51:39.773566800Z 2024-11-08T17:51:39.773Z error: [inference][12868] inference job failed: TypeError: fetch failed
hoarder | 2024-11-08T17:51:39.773600023Z TypeError: fetch failed
hoarder | 2024-11-08T17:51:39.773610232Z at node:internal/deps/undici/undici:13185:13
hoarder | 2024-11-08T17:51:39.773618879Z at async post (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:114:20)
hoarder | 2024-11-08T17:51:39.773628357Z at async Ollama.processStreamableRequest (/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs:232:25)
hoarder | 2024-11-08T17:51:39.773637484Z at async OllamaInferenceClient.runModel (/app/apps/workers/inference.ts:2:3086)
hoarder | 2024-11-08T17:51:39.773645780Z at async OllamaInferenceClient.inferFromText (/app/apps/workers/inference.ts:2:3790)
hoarder | 2024-11-08T17:51:39.773654115Z at async inferTagsFromText (/app/apps/workers/openaiWorker.ts:6:2663)
hoarder | 2024-11-08T17:51:39.773662692Z at async inferTags (/app/apps/workers/openaiWorker.ts:6:2886)
hoarder | 2024-11-08T17:51:39.773671088Z at async Object.runOpenAI [as run] (/app/apps/workers/openaiWorker.ts:6:6316)
hoarder | 2024-11-08T17:51:39.773679774Z at async Runner.runOnce (/app/apps/workers/node_modules/.pnpm/@hoarder+queue@file+packages+queue/node_modules/@hoarder/queue/runner.ts:2:2567)
ollama | 2024-11-08T17:31:38.907763254Z [GIN] 2024/11/08 - 17:31:38 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
ollama | 2024-11-08T17:36:39.093509907Z [GIN] 2024/11/08 - 17:36:39 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
ollama | 2024-11-08T17:41:39.326055551Z [GIN] 2024/11/08 - 17:41:39 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
ollama | 2024-11-08T17:46:39.555181745Z [GIN] 2024/11/08 - 17:46:39 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
ollama | 2024-11-08T17:51:39.773308500Z [GIN] 2024/11/08 - 17:51:39 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
I also tried to increase the hoarder timeout to a large value (1h) just to confirm that something funny like a 20m timeout being an aggregation of 4*5min.
This is getting weirder every time I look at it. I have been going through the logs and I also see cases like this:
hoarder | 2024-11-09T07:54:05.516343115Z 2024-11-09T07:54:05.516Z info: [inference][12916] Starting an inference job for bookmark with id "p5mlkh78gd9s0xux6vfk2qe4"
ollama | 2024-11-09T07:59:05.698825391Z [GIN] 2024/11/09 - 07:59:05 | 200 | 5m0s | 172.22.0.39 | POST "/api/chat"
hoarder | 2024-11-09T07:59:05.722073635Z 2024-11-09T07:59:05.721Z info: [inference][12916] Starting an inference job for bookmark with id "p5mlkh78gd9s0xux6vfk2qe4"
ollama | 2024-11-09T08:00:36.786488401Z [GIN] 2024/11/09 - 08:00:36 | 200 | 1m31s | 172.22.0.39 | POST "/api/chat"
hoarder | 2024-11-09T08:00:36.787668548Z 2024-11-09T08:00:36.787Z info: [inference][12916] Inferring tag for bookmark "p5mlkh78gd9s0xux6vfk2qe4" used 1662 tokens and inferred: database,open source,data management,postgresql,software engineering,web application
hoarder | 2024-11-09T08:00:36.820932932Z 2024-11-09T08:00:36.820Z info: [inference][12916] Completed successfully
This would suggest my previous idea that there is some sort of aggregated timeout.
Is there any chance this could have something to do with the retries? https://github.com/hoarder-app/hoarder/blob/8f44c81b52c8ef5999929c225b3f883d00c157ec/packages/shared/queues.ts#L44 I still however can't find out where are the 5 minutes coming from.
i am not sure what you mean with "this". What I know is that the logging is currently VERY confusing, since timeouts/retries of jobs do not log anything. The error handler for jobs is only invoked once it is completely failed (after all the retries), where it will then log the actual problem of a task. In between you just simply see that that the task was started again (without any info that the previous execution failed or why it failed)
i am thinking it might be this: https://github.com/ollama/ollama-js/issues/103#issuecomment-2227531780
The comment suggests that it has nothing to do with anything in ollama or hoarder, but with the default timeout for fetch requests on nodejs, which seems to be 5 minutes.
@kamtschatka
is that the logging is currently VERY confusing
Yeah, I agree. I'm also as surprised as you are that I ended up implementing it that way :D I can fix that.
fetch requests on nodejs, which seems to be 5 minutes.
That indeed looks very likely to be the case.
i am not sure what you mean with "this". What I know is that the logging is currently VERY confusing, since timeouts/retries of jobs do not log anything. The error handler for jobs is only invoked once it is completely failed (after all the retries), where it will then log the actual problem of a task. In between you just simply see that that the task was started again (without any info that the previous execution failed or why it failed)
Sorry, the context was just in my brain and not in the comment. You described pretty much what I meant, that example shows a successful inference and yet there is the 5m timeout. I can't explain why is the second request so short and returns correctly.
Your deduction of the timeout in nodejs fetch being at 5 minutes is what I have been looking for. One thing does not add up here for me though. What kind of timeout is it? If it would be a connection timeout (like curl), it would only happen, if the server would not respond within 5 minutes. It looks more like a maximum request duration, but I would expect it to not apply in case of a streaming response.
but I would expect it to not apply in case of a streaming response.
@petrm Maybe the time to first token is higher than 5 minutes?
i am thinking it might be this: ollama/ollama-js#103 (comment)
The comment suggests that it has nothing to do with anything in ollama or hoarder, but with the default timeout for fetch requests on nodejs, which seems to be 5 minutes.
I think the bug is mixing different timeouts and also other issues with 5m timeout are being referenced. In one case it is a long loading time - this is not my case, the load is super fast in my case. The other one is the nodejs request duration.
My ollama actually responds with 200 at the 5m mark and not with a 500 like in the referenced issue.
I've the same problem and saw 500 at the 5 Minute mark, but also 200 sometimes, whereas the 200s have been way more often than the 500s.
ok, multiple people are hitting this now. So it's clearly a bug. I'm convinced that it's probably the default fetch timeout thing.
It is the fetch timeout. I "patched" it by doing the following -
Copy the ollama node module off the container.
sudo docker cp hoarder-web-1:/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs ./ollama.9c897541.cjs
Edit this file and add the following.
const { Agent } = require('undici');
// No-timeout fetch function
const noTimeoutFetch = (input, init = {}) => {
return fetch(input, {
...init,
dispatcher: new Agent({ headersTimeout: 3600000 }), // 1-hour timeout
});
};
In the Ollama class constructor, replace this.fetch = fetch;
with...
this.fetch = noTimeoutFetch;
Then I edited my Docker CF, mounted this file as a volume to the web service and recreated the container.
services:
...
web:
...
volumes:
- /docker/appdata/hoarder/data:/data
- ./ollama.9c897541.cjs:/app/apps/workers/node_modules/.pnpm/ollama@0.5.9/node_modules/ollama/dist/shared/ollama.9c897541.cjs
Basically just copied the solution someone linked above.
Can confirm that @sbarbett changes mentioned does indeed resolve the issue.
Nevertheless, as far as I can see, this is a change in a library and not in hoarder itself - otherwise I would have done that in a PR. Unfortunately, I'm not into next.js that deep to support. Sorry...
I agree that we shouldn't be hand-modifying the node modules in the way I described above. The issue needs to be addressed upstream, i.e. someone should open a discussion with undici
or the Nodejs team and ask for a way to make headersTimeout
globally configurable (if it isn't already, I didn't dig too deep here and am just going off @PawelJ-PL's comment from the other thread).
An interim solution would be to make a custom wrapper. By default, it will be 5 minutes unless someone adds an INFERENCE_FETCH_TIMEOUT_SEC
variable to their environment.
Call this customFetch.js
:
const { Agent } = require('undici');
// Default timeout of 5 minutes in milliseconds
const defaultTimeout = 5 * 60 * 1000;
// Custom fetch function that uses a custom timeout
const noTimeoutFetch = (input, init = {}) => {
// Get timeout from environment variable or use default
const timeout = process.env.INFERENCE_FETCH_TIMEOUT_SEC
? parseInt(process.env.INFERENCE_FETCH_TIMEOUT_SEC, 10) * 1000
: defaultTimeout;
// Set up an Agent with the configured timeout
return fetch(input, {
...init,
dispatcher: new Agent({ headersTimeout: timeout }),
});
};
module.exports = { noTimeoutFetch };
Then wherever the Ollama client is initiated in the Hoarder code, this needs imported and passed to the constructor.
const { noTimeoutFetch } = require('../utils/customFetch'); // Adjust path as necessary
// ...
const Ollama = require('ollama-js');
const ollamaClient = new Ollama({
host: appConfig.OLLAMA_BASE_URL,
fetch: noTimeoutFetch, // Use custom fetch here
});
In the YAML we'd just add this to the environment like so.
services:
web:
environment:
- INFERENCE_FETCH_TIMEOUT_SEC=3600 # 1 hour fetch timeout
Down the road, if they ever implement a global timeout, you'd just get rid of the custom wrapper and map the INFERENCE_FETCH_TIMEOUT_SEC
environment variable directly.
I wrote a custom wrapper to fix this bug and submitted a pull request.
Describe the Bug
I am running AI tagging with Ollama and I have set the timeout to INFERENCE_JOB_TIMEOUT_SEC=1200. This is not respected, if the query runs longer than 5 minutes, it just fails without any error message and runs the inference job again. I looked through the ollama bug tracker and there have been similar issues reported. They were all resolved with the message that it is the client closing the connection and it was suggested to switch to a streamed response. I also checked the ollama source code for any timeouts and found out that OLLAMA_KEEP_ALIVE has a default value of 5m. Changing this to a larger value does not have any effect on this issue.
Using curl against ollama confirms that queries running longer than 5 minutes can be completed successfully:
Steps to Reproduce
Set INFERENCE_JOB_TIMEOUT_SEC to a value >5m and run an inference job longer than 5m with ollama.
Expected Behaviour
Inference completes succesfully.
Screenshots or Additional Context
No response
Device Details
No response
Exact Hoarder Version
v18