Open git-hamza opened 1 month ago
Hey there, @git-hamza! I'm here to help with any bugs, questions, or contributions you have in mind. Let's tackle this together!
I found a similar unsolved discussion that might be relevant to your issue: Assistance Needed: Reason for significantly lower context length limit with LlamaIndex vs appending raw text? [1].
To fix the issue where additional_kwargs
in ChatMessage
causes the token limit to be exceeded in CondensePlusContextChatEngine
when using ChatMemoryBuffer
, you need to ensure that the token count calculation includes the additional_kwargs
content. You can modify the _token_count_for_messages
method to account for additional_kwargs
in each ChatMessage
.
Here's the updated _token_count_for_messages
method:
def _token_count_for_messages(self, messages: List[ChatMessage]) -> int:
if len(messages) <= 0:
return 0
msg_str = " ".join(
str(m.content) + " " + " ".join(str(v) for v in m.additional_kwargs.values())
for m in messages
)
return len(self.tokenizer_fn(msg_str))
This modification ensures that the content of additional_kwargs
is included in the token count calculation, preventing the token limit from being exceeded [2].
To continue talking to Dosu, mention @dosu.
Bug Description
I am working with CondensePlusContextChatEngine and using ChatMemoryBuffer. The ChatMessage provides additional_kwargs and I am using it to store extra information with respect to each message.
I have set the token limit for the chat memory buffer as 75 percent of my llm context window.
Now when I am chatting with my chat engine, a time comes where I come across Model's maximum context length error and this is because of what happens during
condense_question
part as explained below.The message gets passed through this line successfully
https://github.com/run-llama/llama_index/blob/a18b94699ac4e49b17f3f49879adf29dfc7c3ed3/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py#L247
Since in the ChatMemoryBuffer
get
function we don't take additonal_kwargs into the account when calculating the tokens with _token_count_for_messageshttps://github.com/run-llama/llama_index/blob/044a439dc2fda53c991e49a43e4e8e652dd8a735/llama-index-core/llama_index/core/memory/chat_memory_buffer.py#L124-L140
The problem occurs inside this function
https://github.com/run-llama/llama_index/blob/a18b94699ac4e49b17f3f49879adf29dfc7c3ed3/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py#L250
In
_condense_question
we are using_messages_to_history_str
and pass that directly to the llm and we can see that_messages_to_history_str
returns all of the additional_kwargs as well, so they also become part of the prompt. Hence the prompt gets exceeded form the Token limit and it does not have enough tokens to return the output.https://github.com/run-llama/llama_index/blob/a18b94699ac4e49b17f3f49879adf29dfc7c3ed3/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py#L163-L177
Possible Fixes:
Version
0.11.10
Steps to Reproduce
As explain in the Description.
Relevant Logs/Tracbacks
No response