run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.79k stars 5.06k forks source link

[Question]: as_chat_engine.chat_stream() returns thoughts in stream #16217

Open nbahar-intel opened 4 days ago

nbahar-intel commented 4 days ago

Question Validation

Question

How can I use response_gen without getting the thoughts?

In my example:

hey querying:

            chat_engine = QE().as_chat_engine(chat_history=prefix_messages, verbose=False, streaming=True, system_prompt=system_prompt)
            return chat_engine.stream_chat(prompt)

...
            st.write_stream(response.response_gen)
: The current language of the user is English. The user has greeted me twice, which might indicate they are waiting for a specific response or need assistance with something specific. Answer: Hello again, Nisan! How can I assist you today?

To clarify, I just want stream_gen to stream the answer

dosubot[bot] commented 4 days ago

To use response_gen without getting the thoughts when using as_chat_engine.chat_stream(), you need to ensure that the _is_function flag is set to False. This flag is used to determine whether the response contains tool calls or other additional information.

Here is a snippet of how you can achieve this:

async def awrite_response_to_history(
    self,
    memory: BaseMemory,
) -> None:
    if self.achat_stream is None:
        raise ValueError(
            "achat_stream is None. Cannot asynchronously write to "
            "history without achat_stream."
        )

    try:
        final_text = ""
        async for chat in self.achat_stream:
            self._is_function = is_function(chat.message)
            self.aput_in_queue(chat.delta)
            final_text += chat.delta or ""
            if self._is_function is False:
                self._is_function_false_event.set()
        if self._is_function is not None:
            chat.message.content = final_text.strip()
            memory.put(chat.message)
    except Exception as e:
        logger.warning(f"Encountered exception writing response to history: {e}")
    self._is_done = True

    self._is_function_false_event.set()
    self._new_item_event.set()

@property
def response_gen(self) -> Generator[str, None, None]:
    while not self._is_done or not self._queue.empty():
        try:
            delta = self._queue.get(block=False)
            self._unformatted_response += delta
            yield delta
        except queue.Empty:
            continue
    self.response = self._unformatted_response.strip()

async def async_response_gen(self) -> AsyncGenerator[str, None]:
    while not self._is_done or not self._aqueue.empty():
        if not self._aqueue.empty():
            delta = self._aqueue.get_nowait()
            self._unformatted_response += delta
            yield delta
        else:
            await self._new_item_event.wait()
            self._new_item_event.clear()
    self.response = self._unformatted_response.strip()

In this code, the _is_function flag is checked, and if it is False, the _is_function_false_event is set, which ensures that only the response text is processed without additional thoughts or tool calls [1][2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 4 days ago

Technically it shouldn't be, but the code is a little complex to debug https://github.com/run-llama/llama_index/blob/faa96cefb93b07c22d1a512e575b4a884a565cd4/llama-index-core/llama_index/core/agent/react/step.py#L625

maybe its worth just building an agent more from-scratch so that you have more control over the output

General workflows walkthrough: https://docs.llamaindex.ai/en/stable/module_guides/workflow/#workflows

React Agent example: https://docs.llamaindex.ai/en/stable/examples/workflow/react_agent/

Function calling agent example with streaming: https://colab.research.google.com/drive/1wVCkvX7oQu1ZwrMSAyaJ8QyzHyfR0D_j?usp=sharing