Closed ImVexed closed 7 months ago
Hi :). I assume your using the default 2k context window on open-webui? Until today, my project used a much larger context window if possible (like in the case of command-r). I just pushed an update which contains a new settings window, which allows you to adjust the context window. Please confirm that this causes the increase in vRAM usage / decrease in offloaded layers.
If that's the case, I assume ollama just run out of memory on your system?
Yes, it's certainly quicker when I lower the context window size. Though it seems to be breaking. It froze when trying to pull info from the internet here for maybe a minute or so:
ollama | [GIN] 2024/04/14 - 00:29:19 | 200 | 3.916807898s | 172.30.0.3 | POST "/api/chat"
searxng-1 | 2024-04-14 00:29:19,854 WARNING:searx.engines.google: ErrorContext('searx/search/processors/online.py', 116, "response = req(params['url'], **request_args)", 'searx.exceptions.SearxEngineTooManyRequestsException', None, ('Too many request',)) False
searxng-1 | 2024-04-14 00:29:19,854 ERROR:searx.engines.google: Too many requests
searxng-1 | Traceback (most recent call last):
searxng-1 | File "/usr/local/searxng/searx/search/processors/online.py", line 163, in search
searxng-1 | search_results = self._search_basic(query, params)
searxng-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/local/searxng/searx/search/processors/online.py", line 147, in _search_basic
searxng-1 | response = self._send_http_request(params)
searxng-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/local/searxng/searx/search/processors/online.py", line 116, in _send_http_request
searxng-1 | response = req(params['url'], **request_args)
searxng-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/local/searxng/searx/network/__init__.py", line 164, in get
searxng-1 | return request('get', url, **kwargs)
searxng-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/local/searxng/searx/network/__init__.py", line 95, in request
searxng-1 | return future.result(timeout)
searxng-1 | ^^^^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
searxng-1 | return self.__get_result()
searxng-1 | ^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
searxng-1 | raise self._exception
searxng-1 | File "/usr/local/searxng/searx/network/network.py", line 289, in request
searxng-1 | return await self.call_client(False, method, url, **kwargs)
searxng-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/local/searxng/searx/network/network.py", line 272, in call_client
searxng-1 | return Network.patch_response(response, do_raise_for_httperror)
searxng-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
searxng-1 | File "/usr/local/searxng/searx/network/network.py", line 245, in patch_response
searxng-1 | raise_for_httperror(response)
searxng-1 | File "/usr/local/searxng/searx/network/raise_for_httperror.py", line 76, in raise_for_httperror
searxng-1 | raise SearxEngineTooManyRequestsException()
searxng-1 | searx.exceptions.SearxEngineTooManyRequestsException: Too many request, suspended_time=3600
searxng-1 | 2024-04-14 00:29:22,329 ERROR:searx.engines.duckduckgo: engine timeout
searxng-1 | 2024-04-14 00:29:22,423 WARNING:searx.engines.duckduckgo: ErrorContext('searx/engines/duckduckgo.py', 118, 'res = get(query_url)', 'httpx.ConnectTimeout', None, (None, None, 'duckduckgo.com')) False
searxng-1 | 2024-04-14 00:29:22,423 ERROR:searx.engines.duckduckgo: HTTP requests timeout (search duration : 3.0941880460013635 s, timeout: 3.0 s) : ConnectTimeout
backend-1 | 2024/04/14 00:29:22 WARN Error downloading website error="no content found"
And after that went through it then got stuck in a loop:
Here's the full logs: temp.log
I'm pretty sure that it run out of context. 2k tokens isnt much. You can see an estimate of the current context in the backend logs. I assume that the format instructions arent in the context anymore at this point. Which results in the LLM ignoring the requested structure.
closing for #91
Describe the bug When submitting a question:
Exiting chain with error: Post "http://ollama:11434/api/chat": EOF
To Reproduce Steps to reproduce the behavior:
Expected behavior Not break.
Screenshots
Additional context I think this may be some sort of timeout issue? I note it only happens to me with Command-R. I can use Command-R (18.8gb) fine in Ollama's Web UI, with 39/41 layers offloaded to GPU (3090, 24Gb). But when I use LLocalSearch, I only see 19/41 layers offloaded. Not sure if that has anything to do with it, but is confusing me since when I use Mixtral-8x-7b(19Gb) it loads all layers to GPU and has no issues with LLocalSearch.