langchain-ai / langchain

šŸ¦œšŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.05k stars 14.65k forks source link

LiteLLM - cache_hit flag reported as None when using the async stream API #24120

Open m-revetria opened 1 month ago

m-revetria commented 1 month ago

Checked other resources

Example Code

import asyncio
import litellm

from langchain_community.chat_models.litellm_router import ChatLiteLLMRouter
from langchain_core.messages import HumanMessage
from langchain_core.prompt_values import ChatPromptValue
from litellm import Router
from litellm.integrations.custom_logger import CustomLogger

def get_llm_router() -> Router:
    """
    Return a new instance of Router, ensure to pass the following parameters so responses are cached:
        * redis_host
        * redis_port
        * redis_password
        * cache_kwargs
        * cache_responses
        * caching_groups
    """
    raise NotImplementedError('Create your own router')

class MyLogger(CustomLogger):
    async def async_log_success_event(self, kwargs, response_obj: "ModelResponse", start_time, end_time):
        print(f"[MyLogger::async_log_success_event] response id: '{response_obj.id}'; cache_hit: '{kwargs.get('cache_hit', '')}'.\n\n")

my_logger = MyLogger()
litellm.callbacks = [my_logger]

async def chat():
    llm = ChatLiteLLMRouter(router=get_llm_router())

    msg1 = ""
    msg1_count = 0
    async for msg in llm.astream(
            input=ChatPromptValue(messages=[HumanMessage("What's the first planet in solar system?")])):
        msg1 += msg.content
        if msg.content:
            msg1_count += 1

    print(f"msg1 (count={msg1_count}): {msg1}\n\n")

    msg2 = ""
    msg2_count = 0
    async for msg in llm.astream(input=ChatPromptValue(messages=[HumanMessage("What's the first planet in solar system?")])):
        msg2 += msg.content
        if msg.content:
            msg2_count += 1

    print(f"msg2 (count={msg2_count}): {msg2}\n\n")

    await asyncio.sleep(5)

if __name__ == "__main__":
    asyncio.run(chat())

Error Message and Stack Trace (if applicable)

This is the output generated running the shared code:

Intialized router with Routing strategy: latency-based-routing

Routing fallbacks: None

Routing context window fallbacks: None

Router Redis Caching=<litellm.caching.RedisCache object at 0x12370da10>
msg1 (count=20): The first planet in the solar system, starting from the one closest to the Sun, is Mercury.

[MyLogger::async_log_success_event] response id: 'chatcmpl-9jnacYSdnczh2zWMKi3l813lNXVtE'; cache_hit: 'None'.

msg2 (count=1): The first planet in the solar system, starting from the one closest to the Sun, is Mercury.

[MyLogger::async_log_success_event] response id: 'chatcmpl-9jnacYSdnczh2zWMKi3l813lNXVtE'; cache_hit: 'None'.

Notice the two lines starting with [MyLogger::async_log_success_event] saying cache_hit: 'None'. It's expected to be True in the second line as the the call to astream generated a single chunk with the entire message.

Description

I'm trying to cache LLM responses using LiteLLM router cache settings and get notified when a response was obtained from cache instead of LLM. For that purpose I've implemented a custom logger as shown in LiteLLM docs.

The issue is that when I call astream API, as shown in the code snippet above, the cache_hit flag is None, even though for the case where response is returned from cache.

When I call the ainvoke API (await llm.ainvoke(...)) the cache_hit flag is passed as True to my custom logger as expected after the second call to ainvoke.

System Info

$ poetry run python -m langchain_core.sys_info

System Information
------------------
> OS:  Darwin
> OS Version:  Darwin Kernel Version 23.2.0: Wed Nov 15 21:54:10 PST 2023; root:xnu-10002.61.3~2/RELEASE_X86_64
> Python Version:  3.11.7 (main, Dec 15 2023, 12:09:04) [Clang 14.0.6 ]

Package Information
-------------------
> langchain_core: 0.2.13
> langchain: 0.2.7
> langchain_community: 0.2.7
> langsmith: 0.1.85
> langchain_openai: 0.1.15
> langchain_text_splitters: 0.2.0

Packages not installed (Not Necessarily a Problem)
--------------------------------------------------
The following packages were not found:

> langgraph
> langserve
eyurtsev commented 1 month ago

@m-revetria caching for llms and chat models is not supported on the stream path at the moment

m-revetria commented 1 month ago

Hi @eyurtsev, I thought cache was working for the streaming API because the first call to astream returns the answer from LLM in multiple chunks (token by token), while the second call return the same answer in a single chunk. Does this mean the answer was cached?

is caching for streaming in the roadmap? If so, could you tell us a possible ETA for this feature, please?

Thanks!