rashadphz / farfalle

🔍 AI search engine - self-host with local or cloud LLMs
https://www.farfalle.dev/
Apache License 2.0
2.21k stars 166 forks source link

500: Model is at capacity. Please try again later. #20

Closed linge1688 closed 1 month ago

arsaboo commented 1 month ago

Duplicate #18

rashadphz commented 1 month ago

Do you see "Ollama is running" when you visit http://localhost:11434/?

StargazerEcho commented 1 month ago

I've seen similar behavior today on 55a7b75 with local models. ollama on windows, is running, and ollama run commands work as expected. Was able to replicate the error with all three supported local models.

Farfalle using llama locally would start to properly format the response body, and when almost done I get a 500 error. Exact error can differ from search to search.

500: Expecting ',' delimiter: line 1 column 5539 (char 5538) 500: Unterminated string starting at: line 1 column 6974 (char 6973) 500: Expecting ',' delimiter: line 1 column 4091 (char 4090)

No issues with cloud models,

I was seeing the 500 error before, but the only impact appeared to be that the "related" questions wouldn't appear at the main response body.

rashadphz commented 1 month ago

Thanks for the details, I'll check this out and get back to you

rashadphz commented 1 month ago

Do you have any example queries from when you got this error? I'm trying to reproduce the error but its only happened once for me so far.

rashadphz commented 1 month ago

I just looked into it: the smaller models seem inconsistent at generating valid JSON. I simplified one of the prompts to make it easier for the smaller models. This error may still come up, but if it does, let me know what prompt you used.

StargazerEcho commented 1 month ago

Thanks. I'm avoiding 70B models as they don't fit in my GPU. I've been using the sample prompts for basic sanity testing:

openai scarlett johansson?

Just tried again after pulling and rebuilding. Llama worked once. Gemma and Mistral 500'd. Llama throws a 500 on runs after the first.

I'd love to try ph3:14b or command-r if they're easy to add as options. My current model options:

NAME ID SIZE MODIFIED phi3:14b 7.9 GB 24 hours ago command-r:latest 20 GB 2 days ago gemma:latest 5.0 GB 3 days ago gemma:7b 5.0 GB 3 days ago mistral:latest 4.1 GB 3 days ago llama3:latest 4.7 GB 3 days ago

Also, looks like others have seen similar issues with small models and JSON generation. Not sure if this helps provide ideas on resolution:

https://thoughtbot.com/blog/get-consistent-data-from-your-llm-with-json-schema

rashadphz commented 1 month ago

hey I just added phi3:14b, let me know if it works better for you

and thanks for the article! I'll see if there is anything from it that I can implement

Floyr commented 1 month ago

Hey, hello and thank you for the project! If I understand correctly, most often this error comes out if there is not enough RAM. I added SWAP memory, made redis work only on disk (that didn't help much), selected gemma 2b model (it gives an error right away, but on ollama server it works fine) and reduced the number of search results (that also didn't help much).

I also tried to add timeout on asynchronous processes (which didn't help, as I'm not very good at it). I tried to do this based on a problem I had in another project: PrivatGPT. I managed to solve it there by simply adding a timeout of 600 seconds. (There it also gave an error if the model didn't produce a response after 60 seconds)

Please tell me, what do you think is the cause of this?

Oh, forgot to add that for some reason, the "gemma" model didn't work at all, not even 1 second of GPU load, while the other models "llama3" and "Mistral" worked (In the sense that there was a GPU load)

Update: Oh, I see you've added new features. Well, now I'm getting a 500 error right away. I've edited the .env file as advised, but so far it hasn't helped.

Floyr commented 1 month ago

hey I just added phi3:14b, let me know if it works better for you

and thanks for the article! I'll see if there is anything from it that I can implement

Hey, hello and thank you for the project! If I understand correctly, most often this error comes out if there is not enough RAM. I added SWAP memory, made redis work only on disk (that didn't help much), selected gemma 2b model (it gives an error right away, but on ollama server it works fine) and reduced the number of search results (that also didn't help much).

I also tried to add timeout on asynchronous processes (which didn't help, as I'm not very good at it). I tried to do this based on a problem I had in another project: PrivatGPT. I managed to solve it there by simply adding a timeout of 600 seconds. (There it also gave an error if the model didn't produce a response after 60 seconds)

Please tell me, what do you think is the cause of this?

arsaboo commented 1 month ago

@rashadphz I received my first 500 error, and here's the stack trace:


Traceback (most recent call last):
  File "/workspace/src/backend/chat.py", line 111, in stream_qa_objects
    async for completion in response_gen:
  File "/workspace/.venv/lib/python3.11/site-packages/llama_index/core/llms/callbacks.py", line 280, in wrapped_gen
    async for x in f_return_val:
  File "/workspace/.venv/lib/python3.11/site-packages/llama_index/llms/ollama/base.py", line 408, in gen
    chunk = json.loads(line)
            ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 2891 (char 2890)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/workspace/src/backend/main.py", line 97, in generator
    async for obj in stream_qa_objects(chat_request):
  File "/workspace/src/backend/chat.py", line 140, in stream_qa_objects
    raise HTTPException(status_code=500, detail=detail)
fastapi.exceptions.HTTPException: 500: Expecting value: line 1 column 2891 (char 2890)```
arsaboo commented 1 month ago

Now, I must add that the same query worked the next time. It is just a function of the output produced by the LLMs. But, it will be great to handle such errors gracefully.

arsaboo commented 1 month ago

Here's another one:


Traceback (most recent call last):
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_transports/default.py", line 69, in map_httpcore_exceptions

d
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_transports/default.py", line 373, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
    raise exc from None
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
    response = await connection.handle_async_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_async/http11.py", line 143, in handle_async_request
    raise exc
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_async/http11.py", line 113, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_async/http11.py", line 186, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_async/http11.py", line 224, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 32, in read
    with map_exceptions(exc_map):
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/workspace/.venv/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadTimeout
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/workspace/src/backend/search/search_service.py", line 68, in perform_search
    results = await search_provider.search(query)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/src/backend/search/providers.py", line 45, in search
    link_results, image_results = await asyncio.gather(
                                  ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/src/backend/search/providers.py", line 55, in get_link_results
    response = await client.get(
               ^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1801, in get
    return await self.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1574, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1661, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1689, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1763, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_transports/default.py", line 372, in handle_async_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/workspace/.venv/lib/python3.11/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/workspace/src/backend/chat.py", line 84, in stream_qa_objects
    search_response = await perform_search(query)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/src/backend/search/search_service.py", line 75, in perform_search
    raise HTTPException(
fastapi.exceptions.HTTPException: 500: There was an error while searching.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/workspace/src/backend/main.py", line 97, in generator
    async for obj in stream_qa_objects(chat_request):
  File "/workspace/src/backend/chat.py", line 140, in stream_qa_objects
    raise HTTPException(status_code=500, detail=detail)
fastapi.exceptions.HTTPException: 500: 500: There was an error while searching.```
arsaboo commented 1 month ago

Ok....looks like one of the errors may be related to Bing being down right now. Here are the logs from the searx container:


2024-05-23 12:25:51,262 WARNING:searx.engines.qwant: ErrorContext('searx/engines/qwant.py', 191, 'raise SearxEngineAPIException(f"{msg} ({error_code})")', 'searx.exceptions.SearxEngineAPIException', None, ('unknown (None)',)) False
2024-05-23 12:25:51,262 ERROR:searx.engines.qwant: exception : unknown (None)
Traceback (most recent call last):
  File "/usr/local/searxng/searx/search/processors/online.py", line 163, in search
    search_results = self._search_basic(query, params)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/searxng/searx/search/processors/online.py", line 151, in _search_basic
    return self.engine.response(response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/searxng/searx/engines/qwant.py", line 151, in response
    return parse_web_api(resp)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/searxng/searx/engines/qwant.py", line 191, in parse_web_api
    raise SearxEngineAPIException(f"{msg} ({error_code})")
searx.exceptions.SearxEngineAPIException: unknown (None)
2024-05-23 12:25:52,409 WARNING:searx.engines.qwant images: ErrorContext('searx/engines/qwant.py', 191, 'raise SearxEngineAPIException(f"{msg} ({error_code})")', 'searx.exceptions.SearxEngineAPIException', None, ('unknown (None)',)) False
2024-05-23 12:25:52,409 ERROR:searx.engines.qwant images: exception : unknown (None)
Traceback (most recent call last):
  File "/usr/local/searxng/searx/search/processors/online.py", line 163, in search
    search_results = self._search_basic(query, params)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/searxng/searx/search/processors/online.py", line 151, in _search_basic
    return self.engine.response(response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/searxng/searx/engines/qwant.py", line 151, in response
    return parse_web_api(resp)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/searxng/searx/engines/qwant.py", line 191, in parse_web_api
    raise SearxEngineAPIException(f"{msg} ({error_code})")
searx.exceptions.SearxEngineAPIException: unknown (None)
2024-05-23 12:25:54,449 ERROR:searx.engines.bing images: engine timeout
2024-05-23 12:25:54,646 WARNING:searx.engines.bing images: ErrorContext('searx/search/processors/online.py', 116, "response = req(params['url'], **request_args)", 'httpx.TimeoutException', None, (None, None, None)) False
2024-05-23 12:25:54,646 ERROR:searx.engines.bing images: HTTP requests timeout (search duration : 4.2016162520158105 s, timeout: 4.0 s) : TimeoutException```
StargazerEcho commented 1 month ago

Thanks, just tested Phi3:14b. Getting a 500 error like with the other models. Also, it looks like a system prompt may be leaking into the reply. It starts to show what appears to be a system prompt before the 500.

Text like this:

and, and only! IMARN/abstractorElsevier sentences in this format but do not-for asteen Web Page Scenario Summary on the most recent. The SATLuminary toonlines answer the given information from the following sentence;com. Do not using AI'20. answer only the complete your title, and letter YEAR and link for more than or you would be able people in anime Dec titled onwards with no more formal byline.

Then:

500: Expecting ',' delimiter: line 1 column 4091 (char 4090)

rashadphz commented 1 month ago

hey I just added phi3:14b, let me know if it works better for you and thanks for the article! I'll see if there is anything from it that I can implement

Hey, hello and thank you for the project! If I understand correctly, most often this error comes out if there is not enough RAM. I added SWAP memory, made redis work only on disk (that didn't help much), selected gemma 2b model (it gives an error right away, but on ollama server it works fine) and reduced the number of search results (that also didn't help much).

I also tried to add timeout on asynchronous processes (which didn't help, as I'm not very good at it). I tried to do this based on a problem I had in another project: PrivatGPT. I managed to solve it there by simply adding a timeout of 600 seconds. (There it also gave an error if the model didn't produce a response after 60 seconds)

Please tell me, what do you think is the cause of this?

Do the models work outside of Farfalle? When you run `ollama run phi3:14b" in your terminal, is it successful?

rashadphz commented 1 month ago

About the 500 errors: all seem related to forming structured outputs. I'm going to look into this more and see if I can find a fix.

arsaboo commented 1 month ago

No 500 errors today. Yesterday was a bad day with all the search engines down.

But we can add better error handling.

rashadphz commented 1 month ago

I just pushed a new change.

This should hopefully improve the 500 errors related to structured outputs (Expecting ',' delimiter: line 1 ... errors)

Unfortunately for the memory errors, there isn't much that can be done. The smallest model Farfalle offers is 7b parameters, any smaller does not produce good enough results.

arsaboo commented 1 month ago

Unfortunately, not much can be done if Ollama is not working properly.

StargazerEcho commented 1 month ago

Tried latest, unfortunately continuing to get a 500 that pops when it's almost done creating the output

500: Expecting value: line 1 column 4091 (char 4090)

arsaboo commented 1 month ago

What's the rate of token generated by your plans Ollama instance? Wondering if that's causing the issues

StargazerEcho commented 1 month ago

Does this help characterize it? Happy to run a different test.

ollama run --verbose phi3:14b "Please tell me what the major features of MSSQL are and what the current version is"

total duration: 12.4178991s load duration: 1.0842ms prompt eval count: 22 token(s) prompt eval duration: 310.24ms prompt eval rate: 70.91 tokens/s eval count: 767 token(s) eval duration: 12.105725s eval rate: 63.36 tokens/s

rashadphz commented 1 month ago

Tried latest, unfortunately continuing to get a 500 that pops when it's almost done creating the output

500: Expecting value: line 1 column 4091 (char 4090)

Does this happen while the answer is being generated or after? I'm trying to figure out the error is related to follow-up questions or not. Unfortunately I haven't been able to reproduce the error locally

StargazerEcho commented 1 month ago

I believe it happens very near the end of generating the main response. I see almost all the response generate before the 500. Is there a flag I can set or a call to comment out to disable follow up questions as a test?

I'm running ollama on Windows, for what it's worth, and farfalle is running via Docker Desktop on Windows.

Logs from backend:

INFO: 172.22.0.1:36506 - "POST /chat HTTP/1.1" 200 OK 2024-05-24 22:18:54 Traceback (most recent call last): 2024-05-24 22:18:54 File "/workspace/src/backend/chat.py", line 111, in stream_qa_objects 2024-05-24 22:18:54 async for completion in response_gen: 2024-05-24 22:18:54 File "/workspace/.venv/lib/python3.11/site-packages/llama_index/core/llms/callbacks.py", line 280, in wrapped_gen 2024-05-24 22:18:54 async for x in f_return_val: 2024-05-24 22:18:54 File "/workspace/.venv/lib/python3.11/site-packages/llama_index/llms/ollama/base.py", line 408, in gen 2024-05-24 22:18:54 chunk = json.loads(line) 2024-05-24 22:18:54 ^^^^^^^^^^^^^^^^ 2024-05-24 22:18:54 File "/usr/local/lib/python3.11/json/init.py", line 346, in loads 2024-05-24 22:18:54 return _default_decoder.decode(s) 2024-05-24 22:18:54 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-05-24 22:18:54 File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode 2024-05-24 22:18:54 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 2024-05-24 22:18:54 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-05-24 22:18:54 File "/usr/local/lib/python3.11/json/decoder.py", line 353, in raw_decode 2024-05-24 22:18:54 obj, end = self.scan_once(s, idx) 2024-05-24 22:18:54 ^^^^^^^^^^^^^^^^^^^^^^ 2024-05-24 22:18:54 json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 4091 (char 4090) 2024-05-24 22:18:54 2024-05-24 22:18:54 During handling of the above exception, another exception occurred: 2024-05-24 22:18:54 2024-05-24 22:18:54 Traceback (most recent call last): 2024-05-24 22:18:54 File "/workspace/src/backend/main.py", line 97, in generator 2024-05-24 22:18:54 async for obj in stream_qa_objects(chat_request): 2024-05-24 22:18:54 File "/workspace/src/backend/chat.py", line 140, in stream_qa_objects 2024-05-24 22:18:54 raise HTTPException(status_code=500, detail=detail) 2024-05-24 22:18:54 fastapi.exceptions.HTTPException: 500: Expecting ',' delimiter: line 1 column 4091 (char 4090)

arsaboo commented 1 month ago

Anything in docker logs?

StargazerEcho commented 1 month ago

just added it to my previous comment before I saw your response

rashadphz commented 1 month ago

It looks like this error is related to Ollama. The Ollama endpoint streams back JSON in this format for each predicted token:

{"model":"gemma:7b","created_at":"2024-05-25T02:37:36.646433Z","response":"What","done":false}

Based on your trace, your Ollama streams something that is not JSON.

It's hard to tell what the root cause is, but I'd guess it's related to memory issues. It might be helpful to check your Ollama logs as well.

Checking Ollama Logs instructions: https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md

rashadphz commented 1 month ago

Update: Oh, I see you've added new features. Well, now I'm getting a 500 error right away. I've edited the .env file as advised, but so far it hasn't helped.

I recently added support for SearXNG. This makes the docker-compose take up more memory, and it might be why the problem worsened for you. I'll add another docker-compose file that doesn't start up a SearxNG container.

rashadphz commented 1 month ago

Just pushed a docker-compose that should take less memory. This probably won't fix your problem, but let me know if it changes anything.

Run: docker-compose -f docker-compose-no-searxng.yaml up --remove-orphans

rashadphz commented 1 month ago

Closing this out for now, let me know if you're still having problems.

HakaishinShwet commented 3 weeks ago

@rashadphz facing these 500 issues continously , text is generated but in mid or at end it completely convert whole generated text to 500: Unterminated string starting at: line 1 column 82 (char 81) or something similar :-((( phi3 medium,llama 3 8b tested with tavilly and searxng.