LiteLLM - cache_hit flag reported as None when using the async stream API

m-revetria commented 1 month ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import asyncio
import litellm

from langchain_community.chat_models.litellm_router import ChatLiteLLMRouter
from langchain_core.messages import HumanMessage
from langchain_core.prompt_values import ChatPromptValue
from litellm import Router
from litellm.integrations.custom_logger import CustomLogger

def get_llm_router() -> Router:
    """
    Return a new instance of Router, ensure to pass the following parameters so responses are cached:
        * redis_host
        * redis_port
        * redis_password
        * cache_kwargs
        * cache_responses
        * caching_groups
    """
    raise NotImplementedError('Create your own router')

class MyLogger(CustomLogger):
    async def async_log_success_event(self, kwargs, response_obj: "ModelResponse", start_time, end_time):
        print(f"[MyLogger::async_log_success_event] response id: '{response_obj.id}'; cache_hit: '{kwargs.get('cache_hit', '')}'.\n\n")

my_logger = MyLogger()
litellm.callbacks = [my_logger]

async def chat():
    llm = ChatLiteLLMRouter(router=get_llm_router())

    msg1 = ""
    msg1_count = 0
    async for msg in llm.astream(
            input=ChatPromptValue(messages=[HumanMessage("What's the first planet in solar system?")])):
        msg1 += msg.content
        if msg.content:
            msg1_count += 1

    print(f"msg1 (count={msg1_count}): {msg1}\n\n")

    msg2 = ""
    msg2_count = 0
    async for msg in llm.astream(input=ChatPromptValue(messages=[HumanMessage("What's the first planet in solar system?")])):
        msg2 += msg.content
        if msg.content:
            msg2_count += 1

    print(f"msg2 (count={msg2_count}): {msg2}\n\n")

    await asyncio.sleep(5)

if __name__ == "__main__":
    asyncio.run(chat())

Error Message and Stack Trace (if applicable)

This is the output generated running the shared code:

Intialized router with Routing strategy: latency-based-routing

Routing fallbacks: None

Routing context window fallbacks: None

Router Redis Caching=<litellm.caching.RedisCache object at 0x12370da10>
msg1 (count=20): The first planet in the solar system, starting from the one closest to the Sun, is Mercury.

[MyLogger::async_log_success_event] response id: 'chatcmpl-9jnacYSdnczh2zWMKi3l813lNXVtE'; cache_hit: 'None'.

msg2 (count=1): The first planet in the solar system, starting from the one closest to the Sun, is Mercury.

[MyLogger::async_log_success_event] response id: 'chatcmpl-9jnacYSdnczh2zWMKi3l813lNXVtE'; cache_hit: 'None'.

Notice the two lines starting with [MyLogger::async_log_success_event] saying cache_hit: 'None'. It's expected to be True in the second line as the the call to astream generated a single chunk with the entire message.

Description

I'm trying to cache LLM responses using LiteLLM router cache settings and get notified when a response was obtained from cache instead of LLM. For that purpose I've implemented a custom logger as shown in LiteLLM docs.

The issue is that when I call astream API, as shown in the code snippet above, the cache_hit flag is None, even though for the case where response is returned from cache.

When I call the ainvoke API (await llm.ainvoke(...)) the cache_hit flag is passed as True to my custom logger as expected after the second call to ainvoke.

System Info

$ poetry run python -m langchain_core.sys_info

System Information
------------------
> OS:  Darwin
> OS Version:  Darwin Kernel Version 23.2.0: Wed Nov 15 21:54:10 PST 2023; root:xnu-10002.61.3~2/RELEASE_X86_64
> Python Version:  3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]

Package Information
-------------------
> langchain_core: 0.2.13
> langchain: 0.2.7
> langchain_community: 0.2.7
> langsmith: 0.1.85
> langchain_openai: 0.1.15
> langchain_text_splitters: 0.2.0

Packages not installed (Not Necessarily a Problem)
--------------------------------------------------
The following packages were not found:

> langgraph
> langserve

eyurtsev commented 1 month ago

@m-revetria caching for llms and chat models is not supported on the stream path at the moment

m-revetria commented 1 month ago

Hi @eyurtsev, I thought cache was working for the streaming API because the first call to astream returns the answer from LLM in multiple chunks (token by token), while the second call return the same answer in a single chunk. Does this mean the answer was cached?

is caching for streaming in the roadmap? If so, could you tell us a possible ETA for this feature, please?

Thanks!

langchain-ai / langchain