Open eero-t opened 1 month ago
FYI: I have (k8s) readiness probes on TGI and TEI services, because otherwise things fail due to (k8s) service endpoint sending traffic to recently scaled up TGI instances, although TGI is not yet ready to accept requests.
However, sometimes TGI goes to non-ready state some time later (e.g. when there are multiple instances on same Xeon node, and all of them are stressed at the same time). There are no warnings in logs, no crashes. TGI just stops responding, k8s marks them as non-ready, and does not route traffic to them (for a while). This is likely cause at least for some of those exceptions in the services using TGI.
Setup
These errors happen with v0.7 ChatQnA Xeon installation [1], but e.g. updating to TEI services from
1.2-cpu
version to latest1.5-cpu
, and and TGI service from1.4
version to latest2.2
did help.[1] https://github.com/opea-project/GenAIExamples/tree/v0.7/ChatQnA/kubernetes/manifests
Use-case
Constantly stress ChatQnA
chaqna-xeon-backend-server-svc
service endpoint by sending it large[2] number of queries in parallel.[2] compared to actual capacity of the service. E.g. 8 queries in parallel for service running on IceLake Xeon.
Actual outcome
Occasionally:
unexpected EOF
error (i.e. service reply ended before specifiedContent-Length
)For the exeption details, see the attachments:
chaqna-xeon-backend-server-svc
reranking-svc
llm-svc
Logs for pods of
embedding-svc
,redis-vector-db
,retriever-svc
,tei-embedding-svc
,tei-reranking-svc
andtgi-svc
services did not show any exceptions or other errors.unexpected EOF
error can happen beforechaqna-xeon-backend-server-svc
replies with the first token, or after it has already provided 100-200 tokens for the reply.Expected outcome
Services handle common exceptions gracefully; shortly log the error and tell the caller that now is not a good time to do queries (return e.g.
503 Service Unavailable
), instead of spamming log with exception and "crashing" the reply connectionHave some kind of rate-limiting for
chaqna-xeon-backend-server-svc
, so that if it gets too many replies before earlier ones have been processed promptly enough, it will start replying503
pre-emptively[2] (instead of making the situation worse by trying to process all requests although it has no capacity for it currently, and then failing in middle)Note: Rate-limiting helps when service scale-up is slow (TGI pod may take minutes from startup until it's ready to respond), and once service has been scaled as up as it can go. But it needs to be done so that it does not prevent scale-up, or cause too much fluctuation for it.
(No comment on whether that should be implemented in
chaqna-xeon-backend-server-svc
itself, or in some load-balancer in front of it.)