Closed moreno1123 closed 2 months ago
I found a similar issue that might be related to the inconsistent response times you're experiencing. The issue titled "[Bug]: Streaming on REACT chat agent not working as expected" was closed by a pull request [1].
The inconsistent response times with RAG on OpenAIAgent, sometimes stopping before the function call and sometimes after, are likely due to the threading and event handling mechanisms used in the _get_stream_ai_response
method. Here is the relevant code snippet:
def _get_stream_ai_response(
self, task: Task, **llm_chat_kwargs: Any
) -> StreamingAgentChatResponse:
chat_stream_response = StreamingAgentChatResponse(
chat_stream=self._llm.stream_chat(**llm_chat_kwargs),
sources=task.extra_state["sources"],
)
# Get the response in a separate thread so we can yield the response
thread = Thread(
target=chat_stream_response.write_response_to_history,
args=(task.extra_state["new_memory"],),
)
thread.start()
# Wait for the event to be set
chat_stream_response._is_function_not_none_thread_event.wait()
# If it is executing an openAI function, wait for the thread to finish
if chat_stream_response._is_function:
thread.join()
# if it's false, return the answer (to stream)
return chat_stream_response
The variability arises because the method's behavior changes based on the state of chat_stream_response._is_function
. If _is_function
is True
, the method waits for the thread to complete, leading to longer response times. If _is_function
is False
, the method returns the response immediately, resulting in shorter response times [2][3][4][5][6].
Additionally, the call_function
and acall_function
methods handle the execution of functions, which can also contribute to the variability in response times. These methods involve calling tools and handling their outputs, which can vary in execution time.
Here are the relevant snippets for call_function
and acall_function
:
def call_function(
tools: List[BaseTool], fn_obj: Any, verbose: bool = False
) -> Tuple[ChatMessage, ToolOutput]:
"""Call a function and return the output as a string."""
from openai.types.beta.threads.required_action_function_tool_call import Function
fn_obj = cast(Function, fn_obj)
# TMP: consolidate with other abstractions
name = fn_obj.name
arguments_str = fn_obj.arguments
if verbose:
print("=== Calling Function ===")
print(f"Calling function: {name} with args: {arguments_str}")
tool = get_function_by_name(tools, name)
argument_dict = json.loads(arguments_str)
output = tool(**argument_dict)
if verbose:
print(f"Got output: {output!s}")
print("========================")
return (
ChatMessage(
content=str(output),
role=MessageRole.FUNCTION,
additional_kwargs={
"name": fn_obj.name,
},
),
output,
)
async def acall_function(
tools: List[BaseTool], fn_obj: Any, verbose: bool = False
) -> Tuple[ChatMessage, ToolOutput]:
"""Call an async function and return the output as a string."""
from openai.types.beta.threads.required_action_function_tool_call import Function
fn_obj = cast(Function, fn_obj)
# TMP: consolidate with other abstractions
name = fn_obj.name
arguments_str = fn_obj.arguments
if verbose:
print("=== Calling Function ===")
print(f"Calling function: {name} with args: {arguments_str}")
tool = get_function_by_name(tools, name)
argument_dict = json.loads(arguments_str)
async_tool = adapt_to_async_tool(tool)
output = await async_tool.acall(**argument_dict)
if verbose:
print(f"Got output: {output!s}")
print("========================")
return (
ChatMessage(
content=str(output),
role=MessageRole.FUNCTION,
additional_kwargs={
"name": fn_obj.name,
},
),
output,
)
These methods involve synchronous and asynchronous calls to tools, which can further contribute to the variability in response times based on the nature and complexity of the tool being called [2][3][4][5][6].
To continue talking to Dosu, mention @dosu.
Responses depend on
If i had to guess, the slower responses when the query engine tool gets used (you have a top k of 4, not sure what the chunk size is)
Overall I've noticed 4o-mini isn't nearly as fast as 3.5-turbo (yet)
Chunk size is default 1024. I've tried to lower the top k to 2, but didn't saw some major improvements :/ I thought so too, that it's on openai... They are not really that consistent. Thanks for answering tho.
Bug Description
Very inconsistent response times with RAG on OpenAIAgent. Sometimes it stops before the function call and sometimes after the function call but before the result.
MODEL=gpt-4o-mini
Version
llama-index==0.11.3 llama-index-agent-openai==0.3.0
Steps to Reproduce
Relevant Logs/Tracbacks
No response