opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
216 stars 132 forks source link

Exceptions in ChatQnA services logs when it gets larger number of requests #469

Open eero-t opened 1 month ago

eero-t commented 1 month ago

Setup

These errors happen with v0.7 ChatQnA Xeon installation [1], but e.g. updating to TEI services from 1.2-cpu version to latest 1.5-cpu, and and TGI service from 1.4 version to latest 2.2 did help.

[1] https://github.com/opea-project/GenAIExamples/tree/v0.7/ChatQnA/kubernetes/manifests

Use-case

Constantly stress ChatQnA chaqna-xeon-backend-server-svc service endpoint by sending it large[2] number of queries in parallel.

[2] compared to actual capacity of the service. E.g. 8 queries in parallel for service running on IceLake Xeon.

Actual outcome

Occasionally:

For the exeption details, see the attachments:

Logs for pods of embedding-svc, redis-vector-db, retriever-svc, tei-embedding-svc, tei-reranking-svc and tgi-svc services did not show any exceptions or other errors.

unexpected EOF error can happen before chaqna-xeon-backend-server-svc replies with the first token, or after it has already provided 100-200 tokens for the reply.

Expected outcome

Note: Rate-limiting helps when service scale-up is slow (TGI pod may take minutes from startup until it's ready to respond), and once service has been scaled as up as it can go. But it needs to be done so that it does not prevent scale-up, or cause too much fluctuation for it.

(No comment on whether that should be implemented in chaqna-xeon-backend-server-svc itself, or in some load-balancer in front of it.)

eero-t commented 1 month ago

FYI: I have (k8s) readiness probes on TGI and TEI services, because otherwise things fail due to (k8s) service endpoint sending traffic to recently scaled up TGI instances, although TGI is not yet ready to accept requests.

However, sometimes TGI goes to non-ready state some time later (e.g. when there are multiple instances on same Xeon node, and all of them are stressed at the same time). There are no warnings in logs, no crashes. TGI just stops responding, k8s marks them as non-ready, and does not route traffic to them (for a while). This is likely cause at least for some of those exceptions in the services using TGI.

YuningQiu commented 1 month ago

Hello, thanks for bringing up this issue. The latest OPEA v0.8 has been released, and the latest version of ChatQnA has TEI services as 1.5-cpu and TGI service as v2.1.0.

Could you please give a try on v0.8 ChatQnA and let us know if this issue is still bothering you? Thanks a lot!