[Bug]: additional_kwargs in ChatMessage messes up token limit in CondensePlusContextChatEngine

Bug Description

I am working with CondensePlusContextChatEngine and using ChatMemoryBuffer. The ChatMessage provides additional_kwargs and I am using it to store extra information with respect to each message.

I have set the token limit for the chat memory buffer as 75 percent of my llm context window.

Now when I am chatting with my chat engine, a time comes where I come across Model's maximum context length error and this is because of what happens during condense_questionpart as explained below.

The message gets passed through this line successfully

https://github.com/run-llama/llama_index/blob/a18b94699ac4e49b17f3f49879adf29dfc7c3ed3/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py#L247

Since in the ChatMemoryBuffer get function we don't take additonal_kwargs into the account when calculating the tokens with _token_count_for_messages

https://github.com/run-llama/llama_index/blob/044a439dc2fda53c991e49a43e4e8e652dd8a735/llama-index-core/llama_index/core/memory/chat_memory_buffer.py#L124-L140

The problem occurs inside this function

https://github.com/run-llama/llama_index/blob/a18b94699ac4e49b17f3f49879adf29dfc7c3ed3/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py#L250

In _condense_question we are using _messages_to_history_str and pass that directly to the llm and we can see that _messages_to_history_str returns all of the additional_kwargs as well, so they also become part of the prompt. Hence the prompt gets exceeded form the Token limit and it does not have enough tokens to return the output.

https://github.com/run-llama/llama_index/blob/a18b94699ac4e49b17f3f49879adf29dfc7c3ed3/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py#L163-L177

Possible Fixes:

To not use additional_kwargs as part of the prompt
To include additional_kwargs in token count.

Version

0.11.10

Steps to Reproduce

As explain in the Description.

Relevant Logs/Tracbacks

No response

Hey there, @git-hamza! I'm here to help with any bugs, questions, or contributions you have in mind. Let's tackle this together!

I found a similar unsolved discussion that might be relevant to your issue: Assistance Needed: Reason for significantly lower context length limit with LlamaIndex vs appending raw text? ^[1].

To fix the issue where additional_kwargs in ChatMessage causes the token limit to be exceeded in CondensePlusContextChatEngine when using ChatMemoryBuffer, you need to ensure that the token count calculation includes the additional_kwargs content. You can modify the _token_count_for_messages method to account for additional_kwargs in each ChatMessage.

Here's the updated _token_count_for_messages method:

def _token_count_for_messages(self, messages: List[ChatMessage]) -> int:
    if len(messages) <= 0:
        return 0

    msg_str = " ".join(
        str(m.content) + " " + " ".join(str(v) for v in m.additional_kwargs.values())
        for m in messages
    )
    return len(self.tokenizer_fn(msg_str))

This modification ensures that the content of additional_kwargs is included in the token count calculation, preventing the token limit from being exceeded ^[2].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index