Closed RakeshReddyKondeti closed 3 days ago
The primary difference between CondenseQuestionChatEngine
and CondensePlusContextChatEngine
is that the former uses the query_engine
to generate a response, while the latter directly generates responses from the retrieved context from the retriever.
To handle response synthesis in CondensePlusContextChatEngine
and avoid exceeding the context window, the engine uses a TokenCounter
to estimate the token count of the messages. Here is the relevant code snippet:
initial_token_count = self._token_counter.estimate_tokens_in_messages(
[system_message]
)
self._memory.put(ChatMessage(content=message, role=MessageRole.USER))
chat_messages = [
system_message,
*self._memory.get(initial_token_count=initial_token_count),
]
This ensures that the initial token count is considered when retrieving the chat messages, helping to manage the context window effectively. Additionally, the ChatMemoryBuffer
is initialized with a token limit that is set to the LLM's context window minus a buffer (256 tokens in this case), which further helps in managing the context window size [1][2].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
Hi,
As I understand, the primary difference between the
CondenseQuestionChatEngine
andCondensePlusContextChatEngine
is that the former uses thequery_engine
to generate a response, while the latter directly generates responses from the retrieved context from the retriever. Apart from this, the general workflow remains the same.When the
query_engine
is involved, the user can choose theresponse_synthesis
module, which is crucial (at least in my case). For example, if I ask a summarization question toCondensePlusContextChatEngine
, all the text chunks are retrieved and added to the system prompt, which can exceed the context window of the LLM.Could you please provide a way to handle response synthesis in
CondensePlusContextChatEngine
similar toCondenseQuestionChatEngine
to avoid exceeding the context window?Thank you!