run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.03k stars 5.31k forks source link

[Question]: Stream Responses of ReAct Agent Query Engine without intermediate steps #13226

Closed JulianOestreich90 closed 6 months ago

JulianOestreich90 commented 7 months ago

Question Validation

Question

I want to stream responses of a ReAct Query Engine with FastApi and so far i am doing the following:

from fastapi import APIRouter, Depends, Request
from fastapi.responses import StreamingResponse
from llama_index.core.agent.react.base import ReActAgent

chat_router = r = APIRouter()

async def chat_stream(chat_engine, input_text):
    response = chat_engine.stream_chat(input_text)
    for token in response.response_gen:
        yield token

@r.post("")
async def chat(
    request: Request,
    data: _ChatData = Depends(json_to_model(_ChatData)),
    agent: ReActAgent = Depends(get_agent),
):
    lastMessage = data.messages.pop()

    return StreamingResponse(
        chat_stream(agent, lastMessage.content), media_type="text/event-stream"
    )
    return resp.response

My problem is, that this returns all the intermediate Agent Results in the the response generator. How can i just stream the final answer?

dosubot[bot] commented 7 months ago

To stream only the final answer from the ReAct Query Engine in your FastAPI application, you'll need to adjust your streaming logic to filter out intermediate responses. The current implementation you've shared streams every piece of output generated by the chat_engine.stream_chat(input_text) method. If the ReAct Query Engine or the method you're using to interact with it provides a way to distinguish between intermediate and final responses, you should leverage that.

For instance, if there's a flag or a specific response pattern that indicates a final response, you can modify the chat_stream generator function to yield only when this condition is met. Unfortunately, without specific details on how the ReAct Query Engine differentiates between intermediate and final responses, I can't provide a precise code snippet.

However, conceptually, your chat_stream function would look something like this:

async def chat_stream(chat_engine, input_text):
    response = chat_engine.stream_chat(input_text)
    for token in response.response_gen:
        if token.is_final:  # Assuming `is_final` is a way to identify the final response
            yield token

This assumes response.response_gen yields objects that have an is_final attribute or similar. You'll need to replace is_final with the actual mechanism provided by your chat engine for identifying the final response.

Remember, the key is to identify how the ReAct Query Engine signals the end of a response sequence and use that signal to control what you yield from the chat_stream function. If the engine does not provide such a mechanism, you might need to implement additional logic based on the content of the responses or the sequence in which they are generated.

#### Details **Sources** - [llama-index-core/llama_index/core/agent/react/output_parser.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/agent/react/output_parser.py) - [llama-index-core/llama_index/core/agent/legacy/react/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/agent/legacy/react/base.py) - [llama-index-integrations/agent/llama-index-agent-openai-legacy/llama_index/agent/openai_legacy/openai_agent.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/agent/llama-index-agent-openai-legacy/llama_index/agent/openai_legacy/openai_agent.py) - [llama-index-legacy/llama_index/legacy/agent/legacy/openai_agent.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/agent/legacy/openai_agent.py) - [llama-index-core/llama_index/core/agent/react/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/agent/react/base.py) - [llama-index-legacy/llama_index/legacy/agent/react/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/agent/react/base.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich commented 7 months ago

@JulianOestreich90 What LLM are you using? I'm not actually able to reproduce this

>>> from llama_index.core import VectorStoreIndex, Document
>>> index = VectorStoreIndex.from_documents([Document.example()])
>>> chat_engine = index.as_chat_engine(chat_mode="react")
>>> response = chat_engine.stream_chat("Tell me a fact about LLMs?")
>>> for token in response.response_gen:
...   print(token, end="", flush=True)
... 
LLMs are pre-trained on large amounts of publicly available data, making them a powerful tool for knowledge generation and reasoning. 
JulianOestreich90 commented 6 months ago

@logan-markewich I am using multi-document agents like in the example in the documentation. I made some changes by using a ReAct Agent with Mistral-7B-Instruct-v0.2 on llama cpp instead of an OpenAIAgent. I pass the top_agent into the chat_stream() fkt together with the input text.

JulianOestreich90 commented 6 months ago

The problem is solved. It just occured for specific queries.

jp-kh-kim commented 4 months ago

@JulianOestreich90 Hi, I'm stuck at this problem, can you share how did you solve this problem?