[Bug]: Inconsistent response times from OpenAIAgent

moreno1123 commented 2 months ago

Bug Description

Very inconsistent response times with RAG on OpenAIAgent. Sometimes it stops before the function call and sometimes after the function call but before the result.

MODEL=gpt-4o-mini

Version

llama-index==0.11.3 llama-index-agent-openai==0.3.0

Steps to Reproduce

def get_agent(language:str = ""):
    logger = logging.getLogger("uvicorn")
    logger.info("Get chat agent")
    logger.info("Set Qdrant client")
    client = qdrant_client.QdrantClient(
        qdrant_link,
        api_key=qdrant_api_key,
        https=True,
        port=None
    )

    logger.info("Get Qdrant Vector Store")
    vector_store = QdrantVectorStore(client=client, collection_name=vdb_collection_name)
    index_basic_info = VectorStoreIndex.from_vector_store(
        vector_store=vector_store
    )
    query_engine_basic_info = index_basic_info.as_query_engine(similarity_top_k=4)

    query_engine_tools = [
        QueryEngineTool(
            query_engine=query_engine_basic_info,
            metadata=ToolMetadata(
                name="general_information",
                description=(
                    "Provides information about company."
                ),
            ),
        )
    ]

    tools = query_engine_tools

    logger.info("Set Open AI Agent")
    agent = OpenAIAgent.from_tools(
        tools=tools, 
        llm=OpenAI(model=model),
        verbose=True,
        system_prompt=get_system_prompt(language),
        max_function_calls=2
    )

    return agent

response = agent.stream_chat(lastMessage.content, messages)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 months ago

I found a similar issue that might be related to the inconsistent response times you're experiencing. The issue titled "[Bug]: Streaming on REACT chat agent not working as expected" was closed by a pull request ^[1].

The inconsistent response times with RAG on OpenAIAgent, sometimes stopping before the function call and sometimes after, are likely due to the threading and event handling mechanisms used in the _get_stream_ai_response method. Here is the relevant code snippet:

def _get_stream_ai_response(
    self, task: Task, **llm_chat_kwargs: Any
) -> StreamingAgentChatResponse:
    chat_stream_response = StreamingAgentChatResponse(
        chat_stream=self._llm.stream_chat(**llm_chat_kwargs),
        sources=task.extra_state["sources"],
    )
    # Get the response in a separate thread so we can yield the response
    thread = Thread(
        target=chat_stream_response.write_response_to_history,
        args=(task.extra_state["new_memory"],),
    )
    thread.start()
    # Wait for the event to be set
    chat_stream_response._is_function_not_none_thread_event.wait()
    # If it is executing an openAI function, wait for the thread to finish
    if chat_stream_response._is_function:
        thread.join()

    # if it's false, return the answer (to stream)
    return chat_stream_response

The variability arises because the method's behavior changes based on the state of chat_stream_response._is_function. If _is_function is True, the method waits for the thread to complete, leading to longer response times. If _is_function is False, the method returns the response immediately, resulting in shorter response times ^[2]^[3]^[4]^[5]^[6].

Additionally, the call_function and acall_function methods handle the execution of functions, which can also contribute to the variability in response times. These methods involve calling tools and handling their outputs, which can vary in execution time.

Here are the relevant snippets for call_function and acall_function:

def call_function(
    tools: List[BaseTool], fn_obj: Any, verbose: bool = False
) -> Tuple[ChatMessage, ToolOutput]:
    """Call a function and return the output as a string."""
    from openai.types.beta.threads.required_action_function_tool_call import Function

    fn_obj = cast(Function, fn_obj)
    # TMP: consolidate with other abstractions
    name = fn_obj.name
    arguments_str = fn_obj.arguments
    if verbose:
        print("=== Calling Function ===")
        print(f"Calling function: {name} with args: {arguments_str}")
    tool = get_function_by_name(tools, name)
    argument_dict = json.loads(arguments_str)
    output = tool(**argument_dict)
    if verbose:
        print(f"Got output: {output!s}")
        print("========================")
    return (
        ChatMessage(
            content=str(output),
            role=MessageRole.FUNCTION,
            additional_kwargs={
                "name": fn_obj.name,
            },
        ),
        output,
    )

async def acall_function(
    tools: List[BaseTool], fn_obj: Any, verbose: bool = False
) -> Tuple[ChatMessage, ToolOutput]:
    """Call an async function and return the output as a string."""
    from openai.types.beta.threads.required_action_function_tool_call import Function

    fn_obj = cast(Function, fn_obj)
    # TMP: consolidate with other abstractions
    name = fn_obj.name
    arguments_str = fn_obj.arguments
    if verbose:
        print("=== Calling Function ===")
        print(f"Calling function: {name} with args: {arguments_str}")
    tool = get_function_by_name(tools, name)
    argument_dict = json.loads(arguments_str)
    async_tool = adapt_to_async_tool(tool)
    output = await async_tool.acall(**argument_dict)
    if verbose:
        print(f"Got output: {output!s}")
        print("========================")
    return (
        ChatMessage(
            content=str(output),
            role=MessageRole.FUNCTION,
            additional_kwargs={
                "name": fn_obj.name,
            },
        ),
        output,
    )

These methods involve synchronous and asynchronous calls to tools, which can further contribute to the variability in response times based on the nature and complexity of the tool being called ^[2]^[3]^[4]^[5]^[6].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 months ago

Responses depend on

how busy openai is
how bug the input is
how much the llm generates

If i had to guess, the slower responses when the query engine tool gets used (you have a top k of 4, not sure what the chunk size is)

logan-markewich commented 2 months ago

Overall I've noticed 4o-mini isn't nearly as fast as 3.5-turbo (yet)

moreno1123 commented 2 months ago

Chunk size is default 1024. I've tried to lower the top k to 2, but didn't saw some major improvements :/ I thought so too, that it's on openai... They are not really that consistent. Thanks for answering tho.

run-llama / llama_index