handle context size overflow in AssistantAgent

sonichi commented 1 year ago

microsoft/FLAML#1098, microsoft/FLAML#1153, microsoft/FLAML#1158 each addresses this in some specialized way. Can we integrate these ideas into a generic solution and make AssistantAgent able to overcome this limitation out of the box?

### Tasks
- [ ] https://github.com/microsoft/FLAML/issues/1143

yiranwu0 commented 1 year ago

I will handle this problem in microsoft/FLAML#1153. The problem should be in generate_reply, when it returns extra long messages. My current plan include the following functionalities:

use tiktoken for more accurate count of tokens, add a static function that checks token left given model and previous messages.
Allow user to pass in a predefined output limit.
when the generated output (for example, from code execution) passes the max token allowed or the user predefined error, it will return a long result error.

@thinkall has implemented the tiktoken count in microsoft/FLAML#1158. Should I try to fix this concurrently?

sonichi commented 1 year ago

I will handle this problem in microsoft/FLAML#1153. The problem should be in generate_reply, when it returns extra long messages. My current plan include the following functionalities:

use tiktoken for more accurate count of tokens, add a static function that checks token left given model and previous messages.

Allow user to pass in a predefined output limit.

when the generated output (for example, from code execution) passes the max token allowed or the user predefined error, it will return a long result error.

@thinkall has implemented the tiktoken count in microsoft/FLAML#1158. Should I try to fix this concurrently?

Your proposal can solve part of the problem. It does the check on the sender's side in case the receiver requests a length limit. There can be other alternatives:

The receiver requests that when the msg is longer than the threshold, the sender sends a part of the msg and they have protocol to deal with the remaining part. microsoft/FLAML#1098 and microsoft/FLAML#1158 are examples of such.
The receiver doesn't request check on the sender's side. It performs compression on the receiver's side. For example, it can employ agents in microsoft/FLAML#1098 to do so. Even when the check on the sender is requested, some compression can still be done at the receiver's side to make room for future msgs.

It'll be good to figure out what we want to support and have a comprehensive design. Could you discuss with @thinkall and @LeoLjl ? You are in the same time zone. Once you have a proposal, @qingyun-wu and I can go over it.

yiranwu0 commented 1 year ago

Sure, I will discuss will @thinkall and @LeoLjl about it.

I just updated microsoft/FLAML#1153 to allow user to set a pre-defined token limit for outputs from code or function call, this is a different task and I think is a different task from handling token_limit in oai_reply.

yiranwu0 commented 1 year ago

@sonichi @qingyun-wu Here is my proposed plan:

On AssistantAgent: Add a parameter on_token_limit from ["Terminate", "Compress" ]. We would if token limit is reached before oai.create is called, if set to "terminate", we would terminate the message. If set to compress, we would use a compress agent to compress previous messages and prepare for future conversations (we can also set a threshold like 80% of max token to start an async agent). I read that openai is doing summary for previous messages if it is too long.

On UserProxyAgent (I already added this in https://github.com/microsoft/FLAML/pull/1153): Allow user to specify the "auto_reply_token_limit". default to -1 (no limit). When auto_reply_token_limit > 0 and the token count from auto reply (code execution or function call) exceeds the limit, the output will be replaced with an error message. This can let users prevent unexpected cases where the output from code execution or functions calls overflowed.

From the two changes above, all 3 generate_reply cases are addressed: oai_reply, code execution and function call. I am thinking of general tasks like problem-solving. @BeibinLi likes the "compression" and "terminate" approach.

For tasks that involve databases and has a large consumption on tokens, like answer questions given a long text, or search for data in a database, I think we need special design targeting at those applications.

sonichi commented 1 year ago

The proposal is a good start. I like the design that covers two options: deal with token limit after/before a reply is made. I think we can generalize this design:

For each auto reply method, we add an optional argument token_limit to let the method know the token limit for each reply. Allow it to be either a user-specified constant or an auto-decided number. The method is responsible for handling that constraint. This includes the retrieval-based auto reply, such as the one in RetrieveChat.
For oai_reply, we catch the token limit error, and return (False, None) when the error happens. That will yield the chance of finalizing the reply and let the next registered method decide the reply. Then, we can register the compressor method to be processed after the oai_reply yields.

yiranwu0 commented 1 year ago

On second thought, I don’t think we need to pass a token_limit argument. Currently for function and code execution, I use a class variable “auto_reply_token_limit” to customize behavior when limit is reached. When a new agent is overloading, they can employ this variable, or just create a new class variable.

sonichi commented 1 year ago

Should the sender tell the receiver the token limit? "token_limit" and ways to handle token_limit should be separated. "token_limit" is a number that should be sent by the sender. Maybe we can make that a field in the message. The way to handle token_limit is decided in the auto reply method.

yiranwu0 commented 1 year ago

I have a few questions when looking at the

In receive function, it calls generate reply without passing in messages: self.generate_reply(sender=sender), so the message will be None. When registered methods such as generate_oai_reply is called, message will be None and it takes out the pre-stored messages:
```
    if messages is None:
        messages = self._oai_messages[sender]
```
It seems that this message argument is not used. When would this be used? One possible usage: when generate_reply is called individually.
the context argument passed to register_auto_reply seem more appropriate to be rename to reply_config? In oai_reply it is converted to llm_config and in code execution it is converted to code_execution_config. In other reply methods it is not used. It seems that "context" can be a field in message from oai and also "content" is a field in message.

sonichi commented 1 year ago

I have a few questions when looking at the

In receive function, it calls generate reply without passing in messages: self.generate_reply(sender=sender), so the message will be None. When registered methods such as generate_oai_reply is called, message will be None and it takes out the pre-stored messages:
        if messages is None:
            messages = self._oai_messages[sender]
It seems that this message argument is not used. When would this be used? One possible usage: when generate_reply is called individually.

the context argument passed to register_auto_reply seem more appropriate to be rename to reply_config? In oai_reply it is converted to llm_config and in code execution it is converted to code_execution_config. In other reply methods it is not used. It seems that "context" can be a field in message from oai and also "content" is a field in message.

Good questions. Regarding 1, yes messages will be used when generate_reply is called individually. We can revise the calling usage in receive function to make it pass messages, to avoid this confusion. Regarding 2, we can rename it into config if we want to avoid the confusion. One thing to note is that this variable could be updated in the reply function to maintain some state. I wanted to use it in other methods too but haven't done the refactoring. @ekzhu is it OK to rename context into config in generate_reply()?

microsoft / autogen

handle context size overflow in AssistantAgent #9