quarkiverse / quarkus-langchain4j

Quarkus Langchain4j extension
https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html
Apache License 2.0
127 stars 73 forks source link

Resilience and AI Service #748

Closed cescoffier closed 1 month ago

cescoffier commented 1 month ago

This issue discusses the resilience in AI Services and their impact on the memory/context.

Context:

I'm using Granite 7B instruct, with a relatively limited context size (2048 tokens). My prompt (user message) is relatively large. I was using @Retry on the IAService method, as the model misbehaves sometimes, and retrying improves reliability (the response time is not a factor in my context).

My AI Service calls are part of an HTTP request processing, so they are part of the request scope associated with the HTTP request.

Problem:

When retrying, the context grows, including multiple times the user message, which eventually exceeds the context size.

Let's try to describe it:

HTTP request -> 
       - AI Service call - context=[user message] -> Failure
       - AI Service call (retry) - context=[user message, user message] -> Failure
       - AI Service call (retry) - context=[user message, user message, user message] -> Failure, because of context size exceeded (unrecoverable)
       -   AI Service call (retry) - context=[user message, user message, user message, user message] -> Failure, because of context size exceeded (useless, it's not recoverable)

While my issue was on a @Retry, it may happen when using @CircuitBreaker, and so on.

Some ideas:

geoand commented 1 month ago

We could handle retry in the chat client. The problem is that we would also need rate limiting and circuit breakers, which would duplicate a lot of complex code.

I am wondering if we can behind the scenes "move" the resilience declared on the AiService to the underlying client...

geoand commented 1 month ago

From the Zulip discussion, another interesting idea is https://github.com/smallrye/smallrye-fault-tolerance/issues/259

maxandersen commented 1 month ago

shouldn't the state used to do a call avoid being mutated before the call has been completed? wouldn't that avoid the "growing" ?

geoand commented 1 month ago

So you are essentially proposing that the chat memory only be added to when the call succeeds, right?

That could potentially work...

geoand commented 1 month ago

That could potentially work...

It's actually a lot trickier than I thought because there are potentially multiple API calls that go into implementing an AI service and that add (and even remove) to / from memory

geoand commented 1 month ago

https://github.com/quarkiverse/quarkus-langchain4j/pull/764 fixes this