[Question]: Use LLM caching with llama index.

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I want to use semantic caching with my llama index application which fetches information from my documents whenever a user query hits using query/chat engine. I am using llama2 as the llm and sentence transformers as the embedding model. I have explored the langchain caching mechanisms like InMemoryCache() which just maintains a dict in the backend for the exact queries. There is the option of RedisSemanticCache to use. I want to use it with my custom llm which is llama2 which is loaded into my local environment to run Here is that part of the code for reference:

system_prompt = """[INST] <>
You are a IT Q&A assistant. Your goal is to answer questions as accurately and completely as possible based on the instructions
and the context provided. You should ensure to give full answers and not miss any information or steps which are present in the documents.<>
"""

query_wrapper_prompt = PromptTemplate("{query_str} [/INST]")

# extracting our own llm from huggingface
# using llama-2-7b-chat-hf version here.

documents = SimpleDirectoryReader("./Data/").load_data()

import torch

llm = HuggingFaceLLM(
    # determines how much previous context is considered when generating new tokens
    context_window=1024,
    # control the length of generated output
    max_new_tokens=256,
    # additional settings: temperature(randomness) & do_sample: false -> will selects similar words
    # with highest probability(no variable randomness)
    generate_kwargs={"temperature": 0.5, "do_sample": True},
    # giving the context to llm(we have defined a system prompt above)
    system_prompt=system_prompt,
    # query format to the model
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    # decides whether to use cpu or gpu (here: auto)
    device_map="cpu",
    # max tokens which can be fed into the model from user input
    tokenizer_kwargs={"max_length": 1024},
    # uncomment this if using CUDA to reduce memory usage
    # the datatype to be used when loading models parameters into pytorch tensor objects. using 16bit floating point numbers reduces
    # memory usage as compared to using 32 bit
    # model_kwargs={"torch_dtype": torch.float16}
    model_kwargs={"torch_dtype": torch.float32}
)

My caching code:

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

set_llm_cache(RedisSemanticCache(redis_url="redis://localhost:6379", embedding=hf))

I am unable to use the llama2 object llm along with redis semantic cache. The query engine goes into infinite execution. I also found that we can use custom llms from huggingface_hub like below:

llm = HuggingFaceHub(
    repo_id=repo_id, model_kwargs={"temperature": 0.5, "max_length": 64}
)

But this would make API calls to models hosted on huggingface which is not ideal for my usecase. Also it requires a huggingface pro subscription to invoke models larger than 10GB of size.

Is there ANY other way of caching responses with llama-index where we can use custom llm as well as embedding models. If not, can someone help me use a local llama2 llm model along with the RedisSemanticCache given by Langchain? Thanks

🤖

Hey @rudrakhsha-c! Great to see you back with another intriguing challenge. Ready to dive back into the world of LlamaIndex? Let's unravel this together. 🌀

Based on the context provided, it seems like you're trying to use a custom LLM (Llama-2) with the RedisSemanticCache provided by Langchain. However, you're facing an issue where the query engine goes into infinite execution.

The LlamaIndex codebase does support semantic caching with custom LLMs and embedding models. This is evident from the SemanticSplitterNodeParser class in the provided context. This class splits a document into nodes, with each node being a group of semantically related sentences. It uses an embedding model to evaluate semantic similarity between sentences. The embed_model field in the class is of type BaseEmbedding, which means it can accept any embedding model that extends the BaseEmbedding class. This allows for the use of custom embedding models.

The buffer_size field determines the number of sentences to group together when evaluating semantic similarity. The breakpoint_percentile_threshold field sets the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a node.

Here is the relevant code:

class SemanticSplitterNodeParser(NodeParser):
    embed_model: BaseEmbedding = Field(
        description="The embedding model to use to for semantic comparison",
    )

    buffer_size: int = Field(
        default=1,
        description=(
            "The number of sentences to group together when evaluating semantic similarity. "
            "Set to 1 to consider each sentence individually. "
            "Set to >1 to group sentences together."
        ),
    )

    breakpoint_percentile_threshold = Field(
        default=95,
        description=(
            "The percentile of cosine dissimilarity that must be exceeded between a "
            "group of sentences and the next to form a node.  The smaller this "
            "number is, the more nodes will be generated"
        ),
    )

The from_defaults method allows for the creation of a SemanticSplitterNodeParser instance with custom parameters, including a custom embedding model:

@classmethod
def from_defaults(
    cls,
    embed_model: Optional[BaseEmbedding] = None,
    ...
) -> "SemanticSplitterNodeParser":
    embed_model = embed_model or OpenAIEmbedding()
    ...
    return cls(
        embed_model=embed_model,
        ...
    )

This shows that the LlamaIndex codebase does indeed support semantic caching with custom LLMs and embedding models.

However, the issue you're facing might be due to some other factors. It could be due to the way the LLM and the RedisSemanticCache are interacting with each other. It's also possible that there might be some issue with the Redis server or the way the RedisSemanticCache is set up.

To debug this issue, you could try the following steps:

Check if the Redis server is running properly and is accessible from your application.
Try using the LLM and the RedisSemanticCache separately to see if they work as expected.
If they work separately, try to gradually integrate them and see at which point the issue arises.
Check the logs for any error messages or warnings that might give a clue about the issue.

If you're still facing the issue after trying these steps, please provide more details about the issue, such as any error messages or logs, and the exact steps to reproduce the issue. This will help in diagnosing the problem and finding a solution.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hey @rudrakhsha-c! Great to see you back with another intriguing challenge. Ready to dive back into the world of LlamaIndex? Let's unravel this together. 🌀

Based on the context provided, it seems like you're trying to use a custom LLM (Llama-2) with the RedisSemanticCache provided by Langchain. However, you're facing an issue where the query engine goes into infinite execution.

The LlamaIndex codebase does support semantic caching with custom LLMs and embedding models. This is evident from the SemanticSplitterNodeParser class in the provided context. This class splits a document into nodes, with each node being a group of semantically related sentences. It uses an embedding model to evaluate semantic similarity between sentences. The embed_model field in the class is of type BaseEmbedding, which means it can accept any embedding model that extends the BaseEmbedding class. This allows for the use of custom embedding models.

The buffer_size field determines the number of sentences to group together when evaluating semantic similarity. The breakpoint_percentile_threshold field sets the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a node.

Here is the relevant code:
class SemanticSplitterNodeParser(NodeParser):
    embed_model: BaseEmbedding = Field(
        description="The embedding model to use to for semantic comparison",
    )

    buffer_size: int = Field(
        default=1,
        description=(
            "The number of sentences to group together when evaluating semantic similarity. "
            "Set to 1 to consider each sentence individually. "
            "Set to >1 to group sentences together."
        ),
    )

    breakpoint_percentile_threshold = Field(
        default=95,
        description=(
            "The percentile of cosine dissimilarity that must be exceeded between a "
            "group of sentences and the next to form a node.  The smaller this "
            "number is, the more nodes will be generated"
        ),
    )
The from_defaults method allows for the creation of a SemanticSplitterNodeParser instance with custom parameters, including a custom embedding model:
@classmethod
def from_defaults(
    cls,
    embed_model: Optional[BaseEmbedding] = None,
    ...
) -> "SemanticSplitterNodeParser":
    embed_model = embed_model or OpenAIEmbedding()
    ...
    return cls(
        embed_model=embed_model,
        ...
    )
This shows that the LlamaIndex codebase does indeed support semantic caching with custom LLMs and embedding models.

However, the issue you're facing might be due to some other factors. It could be due to the way the LLM and the RedisSemanticCache are interacting with each other. It's also possible that there might be some issue with the Redis server or the way the RedisSemanticCache is set up.

To debug this issue, you could try the following steps:

Check if the Redis server is running properly and is accessible from your application.

Try using the LLM and the RedisSemanticCache separately to see if they work as expected.

If they work separately, try to gradually integrate them and see at which point the issue arises.

Check the logs for any error messages or warnings that might give a clue about the issue.

If you're still facing the issue after trying these steps, please provide more details about the issue, such as any error messages or logs, and the exact steps to reproduce the issue. This will help in diagnosing the problem and finding a solution.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

llama_index/node_parser/text/semantic_splitter.py

llama_index/llms/huggingface.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosu-bot then how can one carry out caching in case of llama index? Is it not possible at all for now? That is my main objective to reduce inference time on my chat engine using some techniques like caching. Can you help me with this answer?

🤖

@rudrakhsha-c, I understand your main objective is to reduce inference time on your chat engine using techniques like caching. While LlamaIndex does not currently have built-in support for caching in the context of a chat engine, it is possible to implement caching at the application level.

As I mentioned in my previous response, Python's built-in functools.lru_cache decorator can be used to cache the results of expensive function calls. This can be particularly useful in the context of a chat engine where certain queries or operations may be repeated frequently.

In the context of LlamaIndex, the _generate_context and _agenerate_context methods in the ContextChatEngine class could be good candidates for caching as they are responsible for generating context information from a message. These methods involve retrieval and postprocessing operations which could be computationally expensive.

Here's how you can implement caching for these methods:

from functools import lru_cache

class ContextChatEngine(BaseChatEngine):
    # ...

    @lru_cache(maxsize=128)
    def _generate_context(self, message: str) -> Tuple[str, List[NodeWithScore]]:
        """Generate context information from a message."""
        # ...

    @lru_cache(maxsize=128)
    async def _agenerate_context(self, message: str) -> Tuple[str, List[NodeWithScore]]:
        """Generate context information from a message."""
        # ...

The lru_cache decorator will keep a cache of the most recent function calls up to maxsize number of different argument patterns. If a subsequent call with the same arguments is made, the result is returned from the cache instead of executing the function again.

Please note that this is a simple form of caching and may not be suitable for all use cases. For example, it does not account for cases where the underlying data may change, invalidating the cached result. In such cases, a more sophisticated caching strategy may be required.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu

run-llama / llama_index