Check if we could implement a Cache for LLMs

iocanel commented 4 weeks ago

I've discussed that a little bit with @geoand and @cescoffier and it seems that there might be value in providing the ability to have a cache for LLMs.

It could work by creating and embedding for the prompt and specify threshold for the distance. If the distance is shorter than the specified threshold we could serve the cached response.

Extra points for integrating it with @CacheResult.

geoand commented 4 weeks ago

The only issue I see is when we are caching a response that makes no sense (which can obviously happen due to the non-deterministic nature of LLMs)

andreadimaio commented 4 weeks ago

What do you mean by creating and embedding a prompt? Your idea is to embed the user's queries and, if two queries are similar based on a threshold, return the cached response from the LLM?

cescoffier commented 3 weeks ago

@andreadimaio Yes, exactly.

For each user request, we compute an embedding that we store (with an eviction strategy like a TTL). It should only be used for a "stateless" query (so the context only contains the system message (if any) and the user message, no other message - as it could leak).

When a subsequent query occurs and its embedding is close to a stored one (the threshold is configurable, of course), we would use the cached response.

The rationale is that computing an embedding, especially with in-process/local embedding, is cheaper and faster than calling the LLM.

andreadimaio commented 3 weeks ago

Thanks @cescoffier , so if I understand, the cache will only contain the last message for each user? In my mind the cache could contain more messages, but now if I think about this, multiple messages could generate strange behaviors.

geoand commented 3 weeks ago

I am also wondering that when the user wants to try different prompts in order to test how best to instruct the model to behave, how we are going to handle the cache

andreadimaio commented 3 weeks ago

I am also wondering that when the user wants to try different prompts in order to test how best to instruct the model to behave, how we are going to handle the cache

The @CacheResult annotation can help with this, or not? What I mean is, if the annotation is not present, caching will not be enabled for a particular method or class.

geoand commented 3 weeks ago

Right, I guess my point is more that if we do add this feature, we would need to clearly document that it should be used only when the user has settled on the proper prompt after the exploratory phase of figuring out what that prompt is.

andreadimaio commented 3 weeks ago

Let me go back for a moment to what could happen with a cache of multiple messages. This behavior could be obtained by using LLM as a chatbot.

Threshold = 1 # The queries must be equals.

UserMessage:
Can you create a poem about dog of 5 lines? 

AIMessage:
In morning light, the faithful dog does prance,
With wagging tail and eyes that always dance.
A friend so true, in joy and sorrow near,
A heart so pure, no trace of doubt or fear.
In every bark, a symphony of chance.

UserMessage: Json version please.
AiMessage: {
  "poem": [
    "In morning light, the faithful dog does prance,",
    "With wagging tail and eyes that always dance.",
    "A friend so true, in joy and sorrow near,",
    "A heart so pure, no trace of doubt or fear.",
    "In every bark, a symphony of chance."
  ]
}

UserMessage: Can you create a poem about cat of 5 lines? 
AiMessage:
In moonlit night, the graceful cat does roam,
With silent steps, it claims the world its own.
A gaze so sharp, a mystery it weaves,
A whisper soft, it dances through the leaves.
In every purr, a symphony of home.

UserMessage: Json version please.
AIMessage: <Wrong cached response>

But this is something anyone using this feature should keep in mind.

cescoffier commented 3 weeks ago

@andreadimaio This example is "stateful". You are having a conversation. Caching in this context is difficult and potentially dangerous (you may end up reusing the response generated for a different user and then leaking data). That's why I don't believe, at least in the first iteration that it should be possible to cache in this case.

geoand commented 3 weeks ago

@langchain4j proposed looking at https://techcommunity.microsoft.com/t5/azure-architecture-blog/optimize-azure-openai-applications-with-semantic-caching/ba-p/4106867

langchain4j commented 3 weeks ago

Not exactly that article, but the "semantic cache" concept

geoand commented 3 weeks ago

Right, it's just the first one I found from a known source :)

andreadimaio commented 3 weeks ago

I'm trying to implement this functionality and I have some code that I'm testing locally. I might share a draft PR after I make some more changes.

quarkiverse / quarkus-langchain4j

Check if we could implement a Cache for LLMs #637