Closed rudrakhsha-c closed 2 months ago
π€
Hey @rudrakhsha-c! Great to see you back with another intriguing challenge. Ready to dive back into the world of LlamaIndex? Let's unravel this together. π
Based on the context provided, it seems like you're trying to use a custom LLM (Llama-2) with the RedisSemanticCache provided by Langchain. However, you're facing an issue where the query engine goes into infinite execution.
The LlamaIndex codebase does support semantic caching with custom LLMs and embedding models. This is evident from the SemanticSplitterNodeParser
class in the provided context. This class splits a document into nodes, with each node being a group of semantically related sentences. It uses an embedding model to evaluate semantic similarity between sentences. The embed_model
field in the class is of type BaseEmbedding
, which means it can accept any embedding model that extends the BaseEmbedding
class. This allows for the use of custom embedding models.
The buffer_size
field determines the number of sentences to group together when evaluating semantic similarity. The breakpoint_percentile_threshold
field sets the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a node.
Here is the relevant code:
class SemanticSplitterNodeParser(NodeParser):
embed_model: BaseEmbedding = Field(
description="The embedding model to use to for semantic comparison",
)
buffer_size: int = Field(
default=1,
description=(
"The number of sentences to group together when evaluating semantic similarity. "
"Set to 1 to consider each sentence individually. "
"Set to >1 to group sentences together."
),
)
breakpoint_percentile_threshold = Field(
default=95,
description=(
"The percentile of cosine dissimilarity that must be exceeded between a "
"group of sentences and the next to form a node. The smaller this "
"number is, the more nodes will be generated"
),
)
The from_defaults
method allows for the creation of a SemanticSplitterNodeParser
instance with custom parameters, including a custom embedding model:
@classmethod
def from_defaults(
cls,
embed_model: Optional[BaseEmbedding] = None,
...
) -> "SemanticSplitterNodeParser":
embed_model = embed_model or OpenAIEmbedding()
...
return cls(
embed_model=embed_model,
...
)
This shows that the LlamaIndex codebase does indeed support semantic caching with custom LLMs and embedding models.
However, the issue you're facing might be due to some other factors. It could be due to the way the LLM and the RedisSemanticCache are interacting with each other. It's also possible that there might be some issue with the Redis server or the way the RedisSemanticCache is set up.
To debug this issue, you could try the following steps:
If you're still facing the issue after trying these steps, please provide more details about the issue, such as any error messages or logs, and the exact steps to reproduce the issue. This will help in diagnosing the problem and finding a solution.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
π€
Hey @rudrakhsha-c! Great to see you back with another intriguing challenge. Ready to dive back into the world of LlamaIndex? Let's unravel this together. π
Based on the context provided, it seems like you're trying to use a custom LLM (Llama-2) with the RedisSemanticCache provided by Langchain. However, you're facing an issue where the query engine goes into infinite execution.
The LlamaIndex codebase does support semantic caching with custom LLMs and embedding models. This is evident from the
SemanticSplitterNodeParser
class in the provided context. This class splits a document into nodes, with each node being a group of semantically related sentences. It uses an embedding model to evaluate semantic similarity between sentences. Theembed_model
field in the class is of typeBaseEmbedding
, which means it can accept any embedding model that extends theBaseEmbedding
class. This allows for the use of custom embedding models.The
buffer_size
field determines the number of sentences to group together when evaluating semantic similarity. Thebreakpoint_percentile_threshold
field sets the percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a node.Here is the relevant code:
class SemanticSplitterNodeParser(NodeParser): embed_model: BaseEmbedding = Field( description="The embedding model to use to for semantic comparison", ) buffer_size: int = Field( default=1, description=( "The number of sentences to group together when evaluating semantic similarity. " "Set to 1 to consider each sentence individually. " "Set to >1 to group sentences together." ), ) breakpoint_percentile_threshold = Field( default=95, description=( "The percentile of cosine dissimilarity that must be exceeded between a " "group of sentences and the next to form a node. The smaller this " "number is, the more nodes will be generated" ), )
The
from_defaults
method allows for the creation of aSemanticSplitterNodeParser
instance with custom parameters, including a custom embedding model:@classmethod def from_defaults( cls, embed_model: Optional[BaseEmbedding] = None, ... ) -> "SemanticSplitterNodeParser": embed_model = embed_model or OpenAIEmbedding() ... return cls( embed_model=embed_model, ... )
This shows that the LlamaIndex codebase does indeed support semantic caching with custom LLMs and embedding models.
However, the issue you're facing might be due to some other factors. It could be due to the way the LLM and the RedisSemanticCache are interacting with each other. It's also possible that there might be some issue with the Redis server or the way the RedisSemanticCache is set up.
To debug this issue, you could try the following steps:
- Check if the Redis server is running properly and is accessible from your application.
- Try using the LLM and the RedisSemanticCache separately to see if they work as expected.
- If they work separately, try to gradually integrate them and see at which point the issue arises.
- Check the logs for any error messages or warnings that might give a clue about the issue.
If you're still facing the issue after trying these steps, please provide more details about the issue, such as any error messages or logs, and the exact steps to reproduce the issue. This will help in diagnosing the problem and finding a solution.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
Sources
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot then how can one carry out caching in case of llama index? Is it not possible at all for now? That is my main objective to reduce inference time on my chat engine using some techniques like caching. Can you help me with this answer?
π€
@rudrakhsha-c, I understand your main objective is to reduce inference time on your chat engine using techniques like caching. While LlamaIndex does not currently have built-in support for caching in the context of a chat engine, it is possible to implement caching at the application level.
As I mentioned in my previous response, Python's built-in functools.lru_cache
decorator can be used to cache the results of expensive function calls. This can be particularly useful in the context of a chat engine where certain queries or operations may be repeated frequently.
In the context of LlamaIndex, the _generate_context
and _agenerate_context
methods in the ContextChatEngine
class could be good candidates for caching as they are responsible for generating context information from a message. These methods involve retrieval and postprocessing operations which could be computationally expensive.
Here's how you can implement caching for these methods:
from functools import lru_cache
class ContextChatEngine(BaseChatEngine):
# ...
@lru_cache(maxsize=128)
def _generate_context(self, message: str) -> Tuple[str, List[NodeWithScore]]:
"""Generate context information from a message."""
# ...
@lru_cache(maxsize=128)
async def _agenerate_context(self, message: str) -> Tuple[str, List[NodeWithScore]]:
"""Generate context information from a message."""
# ...
The lru_cache
decorator will keep a cache of the most recent function calls up to maxsize
number of different argument patterns. If a subsequent call with the same arguments is made, the result is returned from the cache instead of executing the function again.
Please note that this is a simple form of caching and may not be suitable for all use cases. For example, it does not account for cases where the underlying data may change, invalidating the cached result. In such cases, a more sophisticated caching strategy may be required.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
Any update on this?
Question Validation
Question
I want to use semantic caching with my llama index application which fetches information from my documents whenever a user query hits using query/chat engine. I am using llama2 as the llm and sentence transformers as the embedding model. I have explored the langchain caching mechanisms like InMemoryCache() which just maintains a dict in the backend for the exact queries. There is the option of RedisSemanticCache to use. I want to use it with my custom llm which is llama2 which is loaded into my local environment to run Here is that part of the code for reference:
My caching code:
I am unable to use the llama2 object
llm
along with redis semantic cache. The query engine goes into infinite execution. I also found that we can use custom llms from huggingface_hub like below:But this would make API calls to models hosted on huggingface which is not ideal for my usecase. Also it requires a huggingface pro subscription to invoke models larger than 10GB of size.
Is there ANY other way of caching responses with llama-index where we can use custom llm as well as embedding models. If not, can someone help me use a local llama2 llm model along with the RedisSemanticCache given by Langchain? Thanks