run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.59k stars 5.03k forks source link

Difference between `CondenseQuestionChatEngine` and `CondensePlusContextChatEngine` #16098

Closed RakeshReddyKondeti closed 3 days ago

RakeshReddyKondeti commented 4 days ago

Question Validation

Question

Hi,

As I understand, the primary difference between the CondenseQuestionChatEngine and CondensePlusContextChatEngine is that the former uses the query_engine to generate a response, while the latter directly generates responses from the retrieved context from the retriever. Apart from this, the general workflow remains the same.

When the query_engine is involved, the user can choose the response_synthesis module, which is crucial (at least in my case). For example, if I ask a summarization question to CondensePlusContextChatEngine, all the text chunks are retrieved and added to the system prompt, which can exceed the context window of the LLM.

Could you please provide a way to handle response synthesis in CondensePlusContextChatEngine similar to CondenseQuestionChatEngine to avoid exceeding the context window?

Thank you!

dosubot[bot] commented 4 days ago

The primary difference between CondenseQuestionChatEngine and CondensePlusContextChatEngine is that the former uses the query_engine to generate a response, while the latter directly generates responses from the retrieved context from the retriever.

To handle response synthesis in CondensePlusContextChatEngine and avoid exceeding the context window, the engine uses a TokenCounter to estimate the token count of the messages. Here is the relevant code snippet:

initial_token_count = self._token_counter.estimate_tokens_in_messages(
    [system_message]
)

self._memory.put(ChatMessage(content=message, role=MessageRole.USER))
chat_messages = [
    system_message,
    *self._memory.get(initial_token_count=initial_token_count),
]

This ensures that the initial token count is considered when retrieving the chat messages, helping to manage the context window effectively. Additionally, the ChatMemoryBuffer is initialized with a token limit that is set to the LLM's context window minus a buffer (256 tokens in this case), which further helps in managing the context window size [1][2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 days ago

https://github.com/run-llama/llama_index/pull/16017