Ollama context token size must be configurable

Rudd-O commented 2 months ago

The problem

Most reasonably smart homes make the HA prompt sent to the Ollama API too large for the default Ollama context size. Without an easy way to change that, half or more of the prompt (in my experience) is ignored / chomped right down the middle.

https://community.home-assistant.io/t/local-ai-llm-on-home-assistant-yellow-with-llama-3-phi-3-gemma-2-and-tinyllama/722332/2 has a bit more context.

What should be possible is, through the Configure button in the config entry for the LLM, the context size should be able to be increased, from the default 2048 to any value the user specifies (hopefully clamped to the maximum context size that the model specifies, which I believe is available via API).

I verified that the exact same prompt, but with context of 4096 tokens (through Open-WebUI, which does let the user change the API call parameters) works flawlessly with all the models that were failing before (3b, 8b and 14b parameters).

What version of Home Assistant Core has the issue?

core-2024.05

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Core

Integration causing the issue

ollama

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

https://community.home-assistant.io/t/local-ai-llm-on-home-assistant-yellow-with-llama-3-phi-3-gemma-2-and-tinyllama/722332/2 has context on the troubles I had.

Rudd-O commented 1 month ago

Any activity on this? Latest release has no change that fixes the issue. Latest change was 4 days ago but nobody took a look at this issue.

home-assistant[bot] commented 1 month ago

Hey there @synesthesiam, mind taking a look at this issue as it has been labeled with an integration (ollama) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of `ollama` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign ollama` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

_{^{(message by CodeOwnersMention)}}

ollama documentation ollama source _{^{(message by IssueLinks)}}

tannisroot commented 1 month ago

Default 2048 context window size is quite detrimental to the quality of the responses. Home Assistant can improve Local LLM performance by a lot by making it configurable and setting it to a high value for the model presets it provides.

Rudd-O commented 1 month ago

I'd be happy to have the context token size increase to 4096. If @synesthesiam is in favor of it, I am glad to do the change.

tannisroot commented 1 month ago

I'd be happy to have the context token size increase to 4096. If @synesthesiam is in favor of it, I am glad to do the change.

Latest models such as llama3.1 can support context token size much larger than that

Rudd-O commented 1 month ago

The situation has worsened with latest Ollama and HA, as the system prompt is totally ignored (when using the default context token size) for even a reasonably-sized smart home.

Anto79-ops commented 1 month ago

This could explain why the current version only works when you have less than 10 entities exposed, which is not very useful.

The model being used is advertised to use 128k context size, not saying it should be that big, but you can easily check this in the Ollama logs journalctl -e -u ollama

Rudd-O commented 1 month ago

I have updated the fix PR to merge right against latest HA. Can someone from HA please just review the PR?

Rudd-O commented 1 month ago

This could explain why the current version only works when you have less than 10 entities exposed, which is not very useful.

That is in fact what is happening. I've tested it with Open-WebUI myself. Default context size? OWUI ignores my HASS-generated system prompt. 4096 tokens num_ctx? Works flawlessly.

Guess the devs didn't bother to test this integration with a real life, complex smart home.

tannisroot commented 1 month ago

This could explain why the current version only works when you have less than 10 entities exposed, which is not very useful.

The model being used is advertised to use 128k context size, not saying it should be that big, but you can easily check this in the Ollama logs journalctl -e -u ollama

This is what I get in the logs personally after interacting with the LLM through Home Assistant. Note the n_ctx=2048 And I don't even have that many devices in Smart Home, only 25.

aug 08 06:59:00 arch ollama[807538]:   Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
aug 08 06:59:00 arch ollama[807538]: llm_load_tensors: ggml ctx size =    0.27 MiB
aug 08 06:59:02 arch ollama[807538]: llm_load_tensors: offloading 32 repeating layers to GPU
aug 08 06:59:02 arch ollama[807538]: llm_load_tensors: offloading non-repeating layers to GPU
aug 08 06:59:02 arch ollama[807538]: llm_load_tensors: offloaded 33/33 layers to GPU
aug 08 06:59:02 arch ollama[807538]: llm_load_tensors:      ROCm0 buffer size =  4156.00 MiB
aug 08 06:59:02 arch ollama[807538]: llm_load_tensors:        CPU buffer size =   281.81 MiB
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: n_ctx      = 8192
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: n_batch    = 512
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: n_ubatch   = 512
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: flash_attn = 0
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: freq_base  = 500000.0
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: freq_scale = 1
aug 08 06:59:02 arch ollama[807538]: llama_kv_cache_init:      ROCm0 KV buffer size =  1024.00 MiB
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model:  ROCm_Host  output buffer size =     2.02 MiB
aug 08 06:59:02 arch ollama[807695]: [1723089542] warming up the model with an empty run
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model:      ROCm0 compute buffer size =   560.00 MiB
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model:  ROCm_Host compute buffer size =    24.01 MiB
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: graph nodes  = 1030
aug 08 06:59:02 arch ollama[807538]: llama_new_context_with_model: graph splits = 2
aug 08 06:59:03 arch ollama[807695]: INFO [main] model loaded | tid="140169818156096" timestamp=1723089543
aug 08 06:59:03 arch ollama[807538]: time=2024-08-08T06:59:03.211+03:00 level=INFO source=server.go:623 msg="llama runner started in 4.01 seconds"
aug 08 06:59:04 arch ollama[807538]: [GIN] 2024/08/08 - 06:59:04 | 200 |  5.069797902s |     192.168.1.5 | POST     "/api/chat"
aug 08 06:59:05 arch ollama[807538]: [GIN] 2024/08/08 - 06:59:05 | 200 |  760.874061ms |     192.168.1.5 | POST     "/api/chat"
aug 08 07:00:01 arch ollama[807695]: INFO [update_slots] input truncated | n_ctx=2048 n_erase=1179 n_keep=4 n_left=2044 n_shift=1022 tid="140169818156096" timestamp=1723089601
aug 08 07:00:02 arch ollama[807538]: [GIN] 2024/08/08 - 07:00:02 | 200 |  779.494209ms |     192.168.1.5 | POST     "/api/chat"
aug 08 07:00:20 arch ollama[807695]: INFO [update_slots] input truncated | n_ctx=2048 n_erase=1179 n_keep=4 n_left=2044 n_shift=1022 tid="140169818156096" timestamp=1723089620

allenporter commented 4 weeks ago

I can try to re-run some evals playing with context size and see if that improves the major losses we see with larger homes. So, if limited context is responsible for quality loss, agree that is good to resolve of course.

On /how/ given each Modelfile specified a context size already, i'd like to understand if ollama should be setting this. The request parameters appear to default to 2048 which i thin is a mistake, but maybe there is a good reason for it. i think the point was to increase it for images, but not sure yet.

allenporter commented 4 weeks ago

Reading the ollama discord, folks are noticing that llama3.1 sets a 2048 context by default which is far too low. I believe the reason it is set lower is because it can affects ram usage

allenporter commented 4 weeks ago

Someone annecdotally reported

Llama3.1 (default 8B, Q4) used 6.7 GB with default context
- 4k used 8.3 GB
- 8k used 11GB
- 32k used 31 GB
- 64k used 56 GB 
- If you exceed the context size then you get crap out of your model so make sure you have a large enough context window.

Anto79-ops commented 4 weeks ago

would a user configurable option be userful? I think the PR mentioned tries to address that.

EDIT: Oh nevermind, it seems ya'll working that on PR....great!

will it be on 2024.8.1 milestone?

allenporter commented 4 weeks ago

FWIW i'm not seeing any memory increase when moving from 2048 to 8192 with llama3.1:8b

I saw an improvement in the assist dataset from 45% to 65% for llama3.1 by moving from 2048 to 8192 context size using #121803 so that is very positive so far.

allenporter commented 4 weeks ago

By the way you can see a log message like this with OLLAMA_DEBUG=1 when the context window may be too small:

time=2024-08-08T15:11:05.380Z level=DEBUG source=prompt.go:51 msg="truncating input messages which exceed context length" truncated=2

tannisroot commented 2 weeks ago

I've found a temporary workaround for this issue until context length is configurable through Home Assistant. In this example I will be using llama3.1 model and setting context length to 8192.

Export the modelfile ollama show llama3.1:latest --modelfile > llama3.1.modelfile
Start editing the modelfile (i used nano since the ollama instance is remote, but you can use anything you want). nano llama3.1.modelfile
Scroll until the license starts and after the PARAMETER stop <|eot_id|> line, add add PARAMETER num_ctx 8192 like so:
```
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
```

PARAMETER num_ctx 8192

LICENSE "LLAMA 3.1 COMMUNITY LICENSE AGREEMENT Llama 3.1 Version Release Date: July 23, 2024


- Save the edits to the file, then create a new model using this modelfile:
`ollama create llama3.1_8192 --file llama3.1.modelfile`
- After this the model will show up in the list in the Ollama integration setup.

Note that 8192 was the minimum value that would successfully work with 25 devices in my Smart Home (i've tested default 2048, 4096, 6144 and that was too little). I've also tried maximum context size of 131072 and that requested 20.1 GiB of VRAM, which is well below 6GB on my Intel A380 GPU.
You can experiment with different context length sizes to see how they affect memory usage, Ollama will print the memory required with the new context sizes in the INFO logs like this: `memory.required.full="5.9 GiB"`
Also note that if VRAM is insufficient, Ollama will offload to system memory, which introduces a performance penalty.

home-assistant / core