opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
213 stars 131 forks source link

ChatQnA mega service failing on GCP #395

Closed srinarayan-srikanthan closed 1 month ago

srinarayan-srikanthan commented 1 month ago

To build the docker images on gcp i had to add an extra flag --network to be able to build the container successfully. I am able to run all the microservices separately and I am not facing any issues.

But running the mega service gives Internal Service Error: And checking the docker logs shows Jsondecode error image image

I see all the containers up and running:

image

Sometimes in the docker logs i see a broken pipe error or aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

lvliang-intel commented 1 month ago

@srinarayan-srikanthan, Did you validate each microservice like below? image

srinarayan-srikanthan commented 1 month ago

yes i validated each microservice and all of them work fine.

eero-t commented 1 month ago

yes i validated each microservice and all of them work fine.

In same setup as where GCP is running? And they returned valid content instead of an error?

I'm asking because some of the backend services can take a long time to initialize themselves, especially on the first run, if model data is not pre-provisioned. Backend services do respond early on, but until they're fully up, response is an error message, and caller typically shows that as JSON decode error because error message does not have content it expected.

srinarayan-srikanthan commented 1 month ago

Yes, the LLM service takes time to load the first time, but all microservices run well. And i am doing all this in v0.7 tag.

eero-t commented 1 month ago

Could you attach the whole backtrace of the error? I'm hoping it provides the address of the service which returned data that was not recognized.

Is it possible that one of the services was migrated to another node, and it did not have the necessary data (yet)?

srinarayan-srikanthan commented 1 month ago

INFO: 10.128.0.15:45354 - "POST /v1/chatqna HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 999, in _read_ready__data_received data = self._sock.recv(self.max_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 174, in call raise exc File "/usr/local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 172, in call await self.app(scope, receive, send_wrapper) File "/usr/local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 72, in app response = await func(request) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/GenAIComps/comps/cores/mega/gateway.py", line 122, in handle_request result_dict = await self.megaservice.schedule(initial_inputs={"text": prompt}, llm_parameters=parameters) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/GenAIComps/comps/cores/mega/orchestrator.py", line 49, in schedule response, node = await done_task ^^^^^^^^^^^^^^^ File "/home/user/GenAIComps/comps/cores/mega/orchestrator.py", line 104, in execute async with session.post(endpoint, json=inputs) as response: File "/usr/local/lib/python3.11/site-packages/aiohttp/client.py", line 1197, in aenter self._resp = await self._coro ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/aiohttp/client.py", line 608, in _request await resp.start(conn) File "/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 976, in start message, payload = await protocol.read() # type: ignore[union-attr] ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/aiohttp/streams.py", line 640, in read await self._waiter aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

eero-t commented 1 month ago

Sometimes in the docker logs i see a broken pipe error or aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

I've just seen something similar after doing doing large number of requests. My current assumption is that after HuggingFace's TGI or TEI (ChatQnA's LLM backends) input buffer became full, they rejected further connections until inferences for some of the already buffered requests had been completed.

Could this (too large number of parallel requests) be also a problem in your case?

srinarayan-srikanthan commented 1 month ago

No I am just sending in one request, so i dont think that is the problem here.

srinarayan-srikanthan commented 1 month ago

I tried it on a BM instance on a lab machine and it works fine, only failing on GCP.

srinarayan-srikanthan commented 1 month ago

Isolated the issue to the core count for retrieval microservice. Working on it to find the required number of cores and fix it. Issue not present with latest code.

srinarayan-srikanthan commented 1 month ago

Resolved in 0.8 release.