run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.75k stars 4.74k forks source link

[Bug]: Streaming on REACT chat agent not working as expected #8147

Closed gich2009 closed 8 months ago

gich2009 commented 9 months ago

Bug Description

The response from the REACT chat agent is not being streamed properly. The agent seems to return an additional inference block instead of the response from the completed REACT process.

Version

<=0.8.45.post1

Steps to Reproduce

Run a react chat agent and stream the output.

Relevant Logs/Tracbacks

$python3 test.py

Hi, how are you?

Response: Hello! I'm an AI teacher here to help you learn about the carbon project development process. How can I assist you today?

 carbon
 project
 development
 process
 consists
 of
 several
 stages
.
 These
 stages
 include
:

1
.
 Project
 Design
:
 This
 is
 the
 initial
 stage
 where
 the
 project
 is
 planned
 and
 designed
.

2
.
 Project
 Validation
:
 In
 this
 stage
,
 the
 project
 design
 is
 reviewed
 and
 validated
 to
 ensure
 its
 feasibility
 and
 compliance
 with
 relevant
 standards
.

3
.
 Project
 Registration
:
 Once
 the
 project
 design
 is
 validated
,
 it
 needs
 to
 be
 registered
 with
 the
 appropriate
 authorities
 or
 certification
 bodies
.

4
.
 Project
 Implementation
:
 This
 stage
 involves
 the
 actual
 implementation
 of
 the
 project
 activities
 according
 to
 the
 design
 and
 plan
.

5
.
 Project
 Monitoring
:
 During
 this
 stage
,
 the
 project
 activities
 are
 monitored
 to
 ensure
 they
 are
 being
 implemented
 as
 planned
 and
 to
 track
 the
 project
's
 progress
.

6
.
 Project
 Verification
:
 After
 the
 project
 activities
 are
 completed
,
 an
 independent
 third
 party
 verifies
 the
 project
's
 outcomes
 and
 emissions
 reductions
.

7
.
 Iss
u
ance
:
 If
 the
 project
 meets
 the
 verification
 requirements
,
 it
 is
 issued
 with
 carbon
 credits
 or
 offsets
.

8
.
 Sales
 and
 Retirement
:
 The
 carbon
 credits
 or
 offsets
 can
 be
 sold
 to
 interested
 buyers
,
 and
 the
 retired
 credits
 are
 permanently
 removed
 from
 circulation
.

These
 stages
 are
 interconnected
 and
 form
 a
 sequential
 process
 for
 developing
 and
 implementing
 carbon
 projects
.
dosubot[bot] commented 9 months ago

πŸ€–

Hello,

Thank you for reporting this issue. Based on the information you've provided and the context from the LlamaIndex repository, it seems like the issue could be related to how the response is being generated and consumed in the stream_chat() method.

In the StreamingAgentChatResponse class, the response_gen and async_response_gen methods are responsible for generating the response. These methods use a queue to store the response tokens and yield them one by one. If the first one or two tokens are missing, it could be because the consumer of the response generator is not properly waiting for the tokens to be available in the queue. This could happen if the consumer is trying to get tokens from the queue before they are put in, resulting in missing tokens. You can find more details about this in a similar solved issue [Bug]: Incomplete Response when streaming from a chat engine.

Additionally, the REACT chat agent might be returning an additional inference block instead of the response from the completed REACT process due to the way the parse method in the ReActOutputParser class is implemented. If the keyword "Thought:" is not found in the output, the method assumes that the agent is directly outputting the answer and returns a ResponseReasoningStep with the thought set to "I can answer without any tools." and the response set to the output. This could be the additional inference block that you are seeing. You can find more details about this in the llama_index/agent/react/output_parser.py file.

Unfortunately, I couldn't find any changes made to the stream_chat() method in the LlamaIndex codebase after version 0.8.42. It would be helpful if you could provide more details about how you are using the stream_chat() method and how the response generator is being consumed. This could help us identify the exact issue and suggest a more accurate solution.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nerdai commented 9 months ago

Thanks @gich2009 for submitting this. I would like to triage this by first trying to reproduce the issue that you're running into.

Could you please provide more details on how your running the ReACT agent, or potentially just share test.py here?

Additionally, have you seen our module guide (notebook) for using ReAct with Query Engine? It's a bit more involved then what you're doing here, as it constructs the ReActAgent from a set of QueryEngineTools.

cwysong85 commented 9 months ago

I'm also running into the same issue. My code is as follows:

def index_chat():
    try:
        if request.method == 'POST':
            cnx = cnxpool.get_connection()
            cur = cnx.cursor()

            logging.info("Received a POST request to /index-chat")
            message: str = request.json['message']
            documentId: str = request.json['documentId']
            userId: str = request.json['userId']

            print(f"Received message: {message}")
            print(f"Received documentId: {documentId}")
            print(f"Received userId: {userId}")

            our_filters = MetadataFilters(filters=[ExactMatchFilter(key="documentId", value=documentId), ExactMatchFilter(key="userId", value=userId)])

            # get previous messages from DB and pass to stream_chat
            sql = "SELECT message, role FROM DocumentChat WHERE documentId = %s"
            val = (documentId, )
            cur.execute(sql, val)
            result = cur.fetchall()

            messages = []
            for row in result:
                if row[1] == "user":
                    messages.append(ChatMessage(role=MessageRole.USER, content=row[0]))
                else:
                    messages.append(ChatMessage(role=MessageRole.ASSISTANT, content=row[0]))

            # save to mysql DB
            sql = "INSERT INTO DocumentChat (message, documentId, role) VALUES (%s, %s, %s)"
            val = (message, documentId, "user")
            cur.execute(sql, val)
            cnx.commit()
            cnx.close()

            llm = OpenAI(temperature=0.1, model="gpt-4", api_key=openai.api_key)
            service_context = ServiceContext.from_defaults(llm=llm, callback_manager=callback_manager)

            vector_store = PineconeVectorStore(api_key=pinecone_api_key, index_name=pinecone_index, environment=pinecone_env, filters=our_filters)
            storage_context = StorageContext.from_defaults(vector_store=vector_store)

            vector_index = VectorStoreIndex.from_vector_store(vector_store=vector_store, storage_context=storage_context, service_context=service_context)

            chat_engine = vector_index.as_chat_engine(chat_mode="react", filters=our_filters, verbose=True)
            def event_stream():
                stream_response = chat_engine.stream_chat(message, chat_history=messages)
                for token in stream_response.response_gen:
                    print(f"Sending token: {token}")
                    yield token

                cnx = cnxpool.get_connection()
                cur = cnx.cursor()
                sql = "INSERT INTO DocumentChat (message, documentId, role) VALUES (%s, %s, %s)"
                val = (stream_response.response, documentId, "assistant")
                cur.execute(sql, val)
                cnx.commit()
                cnx.close()

            return Response(event_stream(), mimetype="text/event-stream")
    except Exception as e:
        logging.error(f"An error occurred: {e}")

I added a print statement to verify that the missing token is coming from the stream_response.response_gen method

nerdai commented 9 months ago

Thanks @cwysong85! Taking a closer look today to see what's happening here and how it can be resolved.

nerdai commented 9 months ago

Okay, I've been trying to replicate the bug, ran into something though I don't know for sure if its what users are experiencing here.

A couple of notes:

  1. If you want to see the ReAct lineage of Thought, Action, and Response, then you should pass in verbose=True in the as_chat_engine method call of the VectorStoreIndex.
  2. I have run into a potential BUG in stream_chat when getting the final response. This may or may not be related to the issue originally raised here.
nerdai commented 9 months ago

Here is the script that I've used to try to replicate the BUG experienced by @gich2009 and @cwysong85: Note: I am running this script in root directory of the repo. Access to .../examples/data/paul_graham is required.

react_example.py

import argparse

from dotenv import load_dotenv

from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index.indices.service_context import ServiceContext
from llama_index.llms import OpenAI

parser = argparse.ArgumentParser()
parser.add_argument(
    "-s", "--streaming", help="streaming or regular chat", action="store_true"
)

def main(streaming: bool = False):
    # set the LLM
    load_dotenv()
    llm = OpenAI(temperature=0.1, model="gpt-4")
    service_context = ServiceContext.from_defaults(llm=llm)

    # create a SimpleVectorStore
    documents = SimpleDirectoryReader("../docs/examples/data/paul_graham").load_data()
    vector_index = VectorStoreIndex.from_documents(documents)

    # create ChatEngine
    chat_engine = vector_index.as_chat_engine(chat_mode="react", verbose=True)

    if not streaming:
        # regular chat
        message = "Hi, how are you?"
        response = chat_engine.chat(message=message)
    else:
        # stream chat
        message = "Hi, how are you?"
        response = chat_engine.stream_chat(message=message)
        response.print_response_stream()

    print("\n")
    print(f"final response:\n\n{response.response}")

if __name__ == "__main__":
    args = parser.parse_args()
    main(args.streaming)

To run react_example.py you need a .env

OPENAI_API_KEY=<fill-in>

You can run the script in either regular or streaming chat mode, by using the option --streaming (default is regular mode).

For example, to run the script using stream_chat

python react_example.py --streaming

This yields

Response: Hello! I'm an AI assistant designed to help with various tasks. How can I assist you today?
'm doing well, thank you! How can I assist you today?

final response:

'm doing well, thank you! How can I assist you today?

Which is clearly different from the ReAct Response output. If running in regular mode then the output matches the final response:

Executing python react_example.py give the following output:

Response: Hello! I'm an AI assistant designed to help with various tasks. How can I assist you today?

final response:

Hello! I'm an AI assistant designed to help with various tasks. How can I assist you today?
nerdai commented 9 months ago

Alright added logging to the script above and observed that the ReActAgent is not terminating when in stream_chat mode after a Response step is made. It makes an additional call to OpenAI with chat context:

[
  ...
  {"role": "user", "content": "Hi, how are you?"}, 
  {"role": "assistant", "content": "Response: Hello! I\'m an AI assistant designed to help with various tasks. How can I assist you today?"} 
]
cwysong85 commented 8 months ago

Alright added logging to the script above and observed that the ReActAgent is not terminating when in stream_chat mode after a Response step is made. It makes an additional call to OpenAI with chat context:

[
  ...
  {"role": "user", "content": "Hi, how are you?"}, 
  {"role": "assistant", "content": "Response: Hello! I\'m an AI assistant designed to help with various tasks. How can I assist you today?"} 
]

I've actually noticed this bug where the "Response" just kept looping the same content string over and over again to OpenAI until OpenAI rate limited the requests. It looked something like this:

Response: Hello! I\'m an AI assistant designed to help with various tasks. How can I assist you today?
Response: Response: Hello! I\'m an AI assistant designed to help with various tasks. How can I assist you today?
Response: Response: Response: Hello! I\'m an AI assistant designed to help with various tasks. How can I assist you today?
Response: Response: Response: Response: Hello! I\'m an AI assistant designed to help with various tasks. How can I assist you today?
etc etc

This bug would only occur in ReAct mode too. Is it possible that these two issues could be related?

nerdai commented 8 months ago

@cwysong85 Yes, I believe that they are related. After investigating this issue for some time, we found that It occurs due to a faulty check on the stream for when the final response (or reasoning step) is about to be sent. If that check returns a false-negative (i.e., the Agent has the answer, but we failed to classify it as the case), then the ReAct agent will go through another iteration.

What's also happening here is that our desired outcome is to steam the final reasoning step, and not the entire ReAct thought/action/observation/... output. This is related, because we rely on the check on whether the stream is part of the final response, signalling the end of the ReAct execution. Ultimately, what I believe was happening here was a result of two things:

A PR is in now that should make this more consistent.

gich2009 commented 8 months ago

Thanks @nerdai for your help in solving this problem. I have been a bit unavailable to assist but I'm glad the issue is resolved. I will test out the fix and report any bugs found.

nerdai commented 8 months ago

No problem @gich2009. 🀝