Resilience and AI Service

cescoffier commented 1 month ago

This issue discusses the resilience in AI Services and their impact on the memory/context.

Context:

I'm using Granite 7B instruct, with a relatively limited context size (2048 tokens). My prompt (user message) is relatively large. I was using @Retry on the IAService method, as the model misbehaves sometimes, and retrying improves reliability (the response time is not a factor in my context).

My AI Service calls are part of an HTTP request processing, so they are part of the request scope associated with the HTTP request.

Problem:

When retrying, the context grows, including multiple times the user message, which eventually exceeds the context size.

Let's try to describe it:

HTTP request -> 
       - AI Service call - context=[user message] -> Failure
       - AI Service call (retry) - context=[user message, user message] -> Failure
       - AI Service call (retry) - context=[user message, user message, user message] -> Failure, because of context size exceeded (unrecoverable)
       -   AI Service call (retry) - context=[user message, user message, user message, user message] -> Failure, because of context size exceeded (useless, it's not recoverable)

While my issue was on a @Retry, it may happen when using @CircuitBreaker, and so on.

Some ideas:

We could imagine a way to manipulate the context when a retry is executed so we would not append it multiple times. However, SmallRye Fault Tolerance does not have this capability yet. We could inject an ID into the call to detect that.
We could handle retry in the chat client. The problem is that we would also need rate limiting and circuit breakers, which would duplicate a lot of complex code.