[Bug]: OpenAIAgentWorkers `get_all_messages` Function may go outside of content window

Bug Description

Hello everyone,

In my example application i get the following error:

raise self._make_status_error_from_response(err.response) from None openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 17195 tokens (17141 in the messages, 54 in the functions). Please reduce the length of the messages or functions.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

I think this is due to the OpenAIAgentWorkers get_all_messages being implemented wrongly.

def get_all_messages(self, task: Task) -> List[ChatMessage]:
        return (
            self.prefix_messages
            + task.memory.get()
            + task.extra_state["new_memory"].get_all()
        )

Adding the three type of messages doesn't use the pruning of the default ChatMemoryBuffer, which causes the query to the LLM to overflow. Is there an workaround for this, and if not may this be fixed?

Thanks for any help in advance.

Version

0.10.42

Steps to Reproduce

Use the create-llama template and define your chat engine as follows:

def get_chat_engine():
    system_prompt = os.getenv("SYSTEM_PROMPT")
    top_k = os.getenv("TOP_K", 3)

    index = get_index()
    if index is None:
        raise HTTPException(
            status_code=500,
            detail=str(
                "StorageContext is empty - call 'poetry run generate' to generate the storage first"
            ),
        )

    return index.as_chat_engine(
        similarity_top_k=int(top_k),
        system_prompt=system_prompt,
        chat_mode=ChatMode.OPENAI,
        response_synthesizer=NoLLM()
    )

The NoLLM() response_synthesizer can be found here: https://github.com/run-llama/llama_index/pull/14439#pullrequestreview-2146507622

Relevant Logs/Tracbacks

No response

Hey @Laktus 👋, I'm here to assist you with any bugs, questions, or contributions. Just hang tight and I'll help you out while we wait for a human maintainer.

To address the issue with the OpenAIAgentWorkers get_all_messages function exceeding the model's maximum context length, you can modify the function to include message pruning. This ensures that the total token count of the messages does not exceed the model's limit.

Here's an example of how you can modify the get_all_messages function:

from typing import List
from llama_index.legacy.llms.ai21_utils import ai21_model_to_context_size

def get_all_messages(model_name: str, messages: List[ChatMessage]) -> List[ChatMessage]:
    max_context_size = ai21_model_to_context_size(model_name)
    current_context_size = 0
    truncated_messages = []

    for message in reversed(messages):
        message_dict = message_to_ai21_message(message)
        message_size = len(openai.Completion.create(prompt=message_dict['content'], max_tokens=0)['choices'][0]['logprobs']['tokens'])

        if current_context_size + message_size > max_context_size:
            break

        truncated_messages.append(message)
        current_context_size += message_size

    return list(reversed(truncated_messages))

In this code:

ai21_model_to_context_size is used to get the maximum context size for the given model.
The messages are iterated in reverse order to prioritize the most recent messages.
Each message's size is calculated and checked against the remaining context size.
Messages are truncated if adding another message would exceed the context size limit.

Additionally, you can use the PromptHelper class from the LlamaIndex framework to manage the prompt size and ensure it does not exceed the model's token limit. Here is an example of how you can integrate PromptHelper:

from llama_index import PromptHelper

def get_all_messages(model_name: str, messages: List[ChatMessage]) -> List[ChatMessage]:
    max_context_size = ai21_model_to_context_size(model_name)
    prompt_helper = PromptHelper(max_input_size=max_context_size, num_output=0, max_chunk_overlap=0)
    current_context_size = 0
    truncated_messages = []

    for message in reversed(messages):
        message_dict = message_to_ai21_message(message)
        message_size = len(openai.Completion.create(prompt=message_dict['content'], max_tokens=0)['choices'][0]['logprobs']['tokens'])

        if current_context_size + message_size > max_context_size:
            break

        truncated_messages.append(message)
        current_context_size += message_size

    return list(reversed(truncated_messages))

This approach ensures that the total token count of the messages does not exceed the model's limit, preventing the context_length_exceeded error ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

For example such a configuration of the prefix messages, task memory and task.extra_state["new_memory"] :

CUR_STEP_OUTPUT_NEXT_STEPS [TaskStep(task_id='7a2aa21e-fb95-4b74-929a-15bfa18a6952', step_id='f8fe2c5a-4498-4673-9958-1370b32513b3', input=None, step_state={}, next_steps={}, prev_steps={}, is_ready=True)]
CUR_STEP_OUTPUT_OUTPUT
COUNTS [P,TM,TE] 116 ,  7629 ,  8441

causes this error:

<Task finished name='Task-42452' coro=<Dispatcher.span..async_wrapper() done, defined at /Users/saltukkezer/Library/Caches/pypoetry/virtualenvs/app-WTiBYyDQ-py3.11/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:212> exception=BadRequestError('Error code: 400 - {\'error\': {\'message\': "This model\'s maximum context length is 16385 tokens. However, your messages resulted in 16411 tokens (16357 in the messages, 54 in the functions). Please reduce the length of the messages or functions.", \'type\': \'invalid_request_error\', \'param\': \'messages\', \'code\': \'context_length_exceeded\'}}')>

Same problem when the memory of the extra state is going over 16k (using gpt-3.5-turbo), because of using the same tool multiple times e.g.

run-llama / llama_index